Parallel Real-Time Scheduling of DAGs

Washngton Unversty n St. Lous Washngton Unversty Open Scholarshp All Computer Scence and Engneerng Research Computer Scence and Engneerng Report Number: WUCSE-013-5 013 Parallel Real-Tme Schedulng of DAGs Authors: Abusayeed Safullah, Davd Ferry, Jng L, Kunal Agrawal, Chenyang Lu, and Chrstopher Gll Recently, mult-core processors have become manstream n processor desgn. To take full advantage of multcore processng, computaton-ntensve real-tme systems must explot ntra-task parallelsm. In ths paper, we address the open problem of real-tme schedulng for a general model of determnstc parallel tasks, where each task s represented as a drected acyclc graph (DAG) wth nodes havng arbtrary executon requrements. We prove processor-speed augmentaton bounds for both preemptve and non-preemptve realtme schedulng for general DAG tasks on mult-core processors. We frst decompose each DAG nto sequental tasks wth ther own release tmes and deadlnes. Then we prove that these decomposed tasks can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst for a general DAG model. We also prove that the decomposton has a resource augmentaton bound of 4 plus a non-preempton overhead for nonpreemptve global EDF schedulng. To our knowledge, ths s the frst resource augmentaton bound for nonpreemptve schedulng of parallel... Read complete abstract on page. Follow ths and addtonal works at: http://openscholarshp.wustl.edu/cse_research Part of the Computer Engneerng Commons, and the Computer Scences Commons Recommended Ctaton Safullah, Abusayeed; Ferry, Davd; L, Jng; Agrawal, Kunal; Lu, Chenyang; and Gll, Chrstopher, "Parallel Real-Tme Schedulng of DAGs" Report Number: WUCSE-013-5 (013). All Computer Scence and Engneerng Research. http://openscholarshp.wustl.edu/cse_research/101 Department of Computer Scence & Engneerng - Washngton Unversty n St. Lous Campus Box 1045 - St. Lous, MO - 63130 - ph: (314) 935-6160.

Parallel Real-Tme Schedulng of DAGs Complete Abstract: Recently, mult-core processors have become manstream n processor desgn. To take full advantage of multcore processng, computaton-ntensve real-tme systems must explot ntra-task parallelsm. In ths paper, we address the open problem of real-tme schedulng for a general model of determnstc parallel tasks, where each task s represented as a drected acyclc graph (DAG) wth nodes havng arbtrary executon requrements. We prove processor-speed augmentaton bounds for both preemptve and non-preemptve realtme schedulng for general DAG tasks on mult-core processors. We frst decompose each DAG nto sequental tasks wth ther own release tmes and deadlnes. Then we prove that these decomposed tasks can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst for a general DAG model. We also prove that the decomposton has a resource augmentaton bound of 4 plus a non-preempton overhead for nonpreemptve global EDF schedulng. To our knowledge, ths s the frst resource augmentaton bound for nonpreemptve schedulng of parallel tasks. Fnally, we evaluate our analytcal results through smulatons that demonstrate that the derved bounds are safe, and reasonably tght n practce, especally under preemptve EDF schedulng. Ths techncal report s avalable at Washngton Unversty Open Scholarshp: http://openscholarshp.wustl.edu/cse_research/101

Parallel Real-Tme Schedulng of DAGs Abusayeed Safullah, Davd Ferry, Jng L, Kunal Agrawal, Chenyang Lu, and Chrstopher Gll Department of Computer Scence and Engneerng Washngton Unversty n St. Lous Abstract Recently, mult-core processors have become manstream n processor desgn. To take full advantage of mult-core processng, computaton-ntensve real-tme systems must explot ntra-task parallelsm. In ths paper, we address the open problem of real-tme schedulng for a general model of determnstc parallel tasks, where each task s represented as a drected acyclc graph (DAG) wth nodes havng arbtrary executon requrements. We prove processor-speed augmentaton bounds for both preemptve and non-preemptve real-tme schedulng for general DAG tasks on mult-core processors. We frst decompose each DAG nto sequental tasks wth ther own release tmes and deadlnes. Then we prove that these decomposed tasks can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models, and s the frst for a general DAG model. We also prove that the decomposton has a resource augmentaton bound of 4 plus a non-preempton overhead for non-preemptve global EDF schedulng. To our knowledge, ths s the frst resource augmentaton bound for non-preemptve schedulng of parallel tasks. Fnally, we evaluate our analytcal results through smulatons that demonstrate that the derved bounds are safe, and reasonably tght n practce, especally under preemptve EDF schedulng. I. INTRODUCTION As the rate of ncrease of clock frequences s levelng off, most processor chp manufacturers have recently moved to ncreasng performance by ncreasng the number of cores on a chp. Intel s 80-core Polars [1], Tlera s 100-core TILE- Gx, AMD s 1-core Opteron [], and ClearSpeed s 96-core processor [3] are some notable examples of mult-core chps. Wth the rapd evoluton of mult-core technology, however, real-tme system software and programmng models have faled to keep pace. Most classc results n real-tme schedulng concentrate on sequental tasks runnng on multple processors [4]. Whle these systems allow many tasks to execute on the same mult-core host, they do not allow an ndvdual task to run any faster on t than on a sngle-core machne. If we want to scale the capabltes of ndvdual tasks wth the number of cores, t s essental to develop new approaches for tasks wth ntra-task parallelsm, where each real-tme task tself s a parallel task that can utlze multple cores at the same tme. Such ntra-task parallelsm may enable tmng guarantees for complex real-tme systems that requre heavy computaton, such as vdeo survellance, computer vson, radar trackng, and hybrd real-tme structural testng [5] whose strngent tmng constrants are dffcult to meet on tradtonal sngle-core processors. There has been some recent work on real-tme schedulng for parallel tasks, but t has been mostly restrcted to the synchronous task model [6], [7]. In the synchronous model, each task conssts of a sequence of segments wth synchronzaton ponts at the end of each segment. In addton, each segment of a task contans threads of executon that are of equal length. For synchronous tasks, the result n [6] proves a resource augmentaton bound of 4 under global earlest deadlne frst (EDF) schedulng. A resource augmentaton under a schedulng polcy quantfes processor speed-up factor (how much we have to ncrease the speed) wth respect to an optmal algorthm to guarantee the schedulablty of a task set. Whle the synchronous task model represents the knd of tasks generated by the parallel for loop construct that s common to many parallel languages such as OpenMP [8] and ClkPlus [9], most parallel languages also have other constructs for generatng parallel programs, notably fork-jon constructs. A program that uses fork-jon constructs wll generate a non-synchronous task, generally represented as a drected acyclc graph (DAG), where each thread (sequence of nstructons) s a node, and the edges represent dependences between the threads. A node s executon requrement can vary arbtrarly, and dfferent nodes n the same DAG can have dfferent executon requrements. Another lmtaton of the state-of-the-art s that all pror work on parallel real-tme tasks consders preemptve schedulng, where threads are allowed to preempt each other n the mddle of executon. Whle ths s a reasonable model, preempton can be a hgh-overhead operaton snce t often nvolves a system call and a context swtch. An alternatve schedulng model s to consder node-level non-preemptve schedulng (smply called non-preemptve schedulng n ths paper), where once the executon of a partcular node (thread) starts t cannot be preempted by any other thread. Most parallel languages and lbrares have yeld ponts at the end of threads (nodes of the DAG), allowng low-cost, user-space preempton at these yeld ponts. For these languages and lbrares, schedulers that swtch context only when threads end (n other words, where threads do not preempt each other) can be mplemented entrely n user-space (wthout nteracton wth the kernel), and therefore have low overheads. In addton, fewer swtches usually mply lower cachng overhead. In ths model, snce a node s never preempted, f t accesses the same memory locaton multple tmes, those memory locatons wll be cached, and a node never has to restart on a cold cache. Ths paper addresses the hard real-tme schedulng problem of a set of generalzed DAGs sharng a mult-core machne. We generalze the prevous work n two mportant drectons. Frst, we consder a general model of determnstc parallel tasks,

where each task s represented by a general DAG n whch nodes can have arbtrary executon requrements. Second, we address both preemptve and non-preemptve schedulng. In partcular, we make the followng new contrbutons. We propose a novel task decomposton to transform the nodes of a general DAG nto sequental tasks. Snce each node of the DAG s transformed nto a sngle sequental subtask, these subtasks can be scheduled ether preemptvely or non-preemptvely. We prove that any set of parallel tasks of a general DAG model, upon decomposton, can be scheduled usng preemptve global EDF wth a resource augmentaton bound of 4. Ths bound s as good as the best known bound for more restrctve models [6] and, to our knowledge, s the frst bound for a general DAG model. We prove that our decomposton requres a resource augmentaton bound of 4+ρ for non-preemptve global EDF schedulng, where ρ s the non-preempton overhead of the tasks. To our knowledge, ths s the frst bound for non-preemptve schedulng of parallel real-tme tasks. We mplement the proposed decomposton algorthm, and evaluate our analytcal results for both preemptve and non-preemptve schedulng through smulatons. The results ndcate that the derved bounds are safe, and reasonably tght n practce, especally under preemptve EDF that requres a resource augmentaton of 3. n smulaton as opposed to our analytcal bound of 4. Secton II revews related work. Secton III descrbes the task model. Secton IV presents the decomposton algorthm. Sectons V and VI present analyses for preemptve and nonpreemptve global EDF schedulng, respectvely. Secton VII presents the smulaton results. Secton VIII offers conclusons. II. RELATED WORK There has been a substantal amount of work on tradtonal multprocessor real-tme schedulng focused on sequental tasks [4]. Schedulng of parallel tasks wthout deadlnes has been addressed n [10] [15]. Soft real-tme schedulng (where the goal s to meet a subset of deadlnes based on applcatonspecfc crtera) has been studed for varous parallel task models and optmzaton crtera such as cache msses [16], [17], makespan [18] and total work done wthn deadlnes [19]. The schedulablty analyss under hard real-tme system (where the goal s to meet all task deadlnes) s ntractable for most cases of parallel tasks wthout resource augmentaton [0]. Some early work makes smplfyng assumptons about task models [1] [4]. For example, some approaches [1], [] address the schedulng of malleable tasks, where tasks can execute on varyng numbers of processors wthout loss n effcency. The study n [3] consders nonpreemptve EDF schedulng of moldable tasks, where the actual number of processors used by a partcular task s determned before startng the system, and remans unchanged. Gang EDF schedulng [4] of moldable parallel tasks requres the users to select (at submsson tme) a fxed number of processors upon whch ther task wll run, and the task must then always use that number of threads. Recently, preemptve real-tme schedulng has been studed [6], [7] for synchronous parallel tasks wth mplct deadlnes. In [7], every task s an alternate sequence of parallel and sequental segments wth each parallel segment consstng of multple threads of equal length that synchronze at the end of the segment. All parallel segments n a task have an equal number of threads whch cannot exceed the number of processor cores. Each thread s transformed nto a subtask, and a resource augmentaton bound of 3.4 s clamed under parttoned Deadlne Monotonc (DM) schedulng. Ths result was later generalzed for synchronous model wth arbtrary numbers of threads n segments, wth bounds of 4 and 5 for global EDF and parttoned DM schedulng, respectvely [6], and also to mnmze the requred number of processors [5]. Our earler work [6] has proposed a smple extenson to a synchronous task schedulng approach that handles unt-node DAG where each node has a unt executon requrement by convertng each task to a synchronous task allowng drect applcaton of the same approach. Ths model s qute restrctve and over-smplfed snce each node or thread of executon has unt-executon requrement that smplfes the analyss for resource augmentaton. However, these assumptons do not hold n general snce ths model does not represent a parallel task that most parallel languages generate. Most parallel languages that use fork-jon constructs generate a non-synchronous task, generally represented as a DAG where each node s executon requrement can vary arbtrarly, and dfferent nodes n the same DAG can have dfferent executon requrements. Notably, the decomposton n [6] for restrctve model s not applcable for a general DAG. If one does so, a sngle node wll splt nto multple smaller subtasks, each wth ts own release tme and deadlne. As a result, when the decomposed tasks are scheduled, there s no easy way of preservng the node-level non-preemptve behavor of orgnal tasks. Schedulng and analyss of general DAGs ntroduces a challengng open problem. For ths general model, an augmentaton bound has been analyzed recently n [6], but t consders the restrcted case of a sngle DAG on a mult-core machne wth preempton. In ths paper, we nvestgate the open problem of schedulng and analyss for a set of any number of general DAGs on a mult-core machne. We consder both preemptve and non-preemptve real-tme schedulng of general DAG tasks on mult-core processors, and provde resource augmentaton bound under both polces. III. PARALLEL TASK MODEL We consder n perodc parallel tasks to be scheduled on a mult-core platform consstng of m dentcal cores. The task set s represented by τ = {τ 1,τ,,τ n }. Each task τ, 1 n, s represented as a Drected Acyclc Graph (DAG), where the nodes stand for dfferent executon requrements, and the edges represent dependences between the nodes. A node n τ s denoted by W j, 1 j n, wth n beng the total number of nodes n τ. The executon requrement of

Fg. 1. A parallel task τ represented as a DAG node W j s denoted by Ej. A drected edge from node W j to node W k, denoted as W j W k, mples that the executon of W k cannot start unless W j has fnshed executon. W j,n ths case, s called a parent of W k, whle W k s ts chld. A node may have 0 or more parents or chldren. A node can start executon only after all of ts parents have fnshed executon. Fgure 1 shows a task τ wth n =10nodes. The executon requrement (.e., work) C of task τ s the sum of the executon requrements of all nodes n τ ; that s, C = n j=1 Ej. Thus, C s the maxmum executon tme of τ f t was executng on a sngle processor of speed 1. For task τ, the crtcal path length, denoted by P, s the sum of executon requrements of the nodes on a crtcal path. A crtcal path s a drected path that has the maxmum executon requrement among all other paths n DAG τ. Thus, P s the mnmum executon tme of τ meanng that t needs at least P tme unts on unt-speed processor cores even when the number of cores m s nfnte. The perod of task τ s denoted by T and the deadlne D of each task τ s consdered mplct,.e., D = T. Snce P s the mnmum executon tme of task τ even on a machne wth an nfnte number of cores, the condton T P must hold for τ to be schedulable (.e. to meet ts deadlne). A task set s sad to be schedulable when all tasks n the set meet ther deadlnes. IV. TASK DECOMPOSITION We schedule parallel tasks by decomposng them nto smaller sequental tasks. Ths strategy allows us to leverage exstng schedulablty analyss for tradtonal multprocessor schedulng (both preemptve and non-preemptve) of sequental tasks. In ths secton, we present a decomposton technque for a parallel task under a general DAG model. Upon decomposton, each node of a DAG becomes an ndvdual sequental task, called a subtask, wth ts own deadlne and wth an executon requrement equal to the node s executon requrement. (Henceforth, we wll use the terms subtask and node nterchangeably.) All nodes of a DAG are assgned approprate deadlnes and release offsets such that when they execute as ndvdual subtasks all dependences among them n the orgnal DAG task are preserved. Thus, an mplct deadlne DAG s decomposed nto a set of constraned deadlne (.e. deadlne s no greater than perod) sequental subtasks wth each subtask correspondng to a node of the DAG. Our schedulablty analyss for parallel tasks entals dervng a resource augmentaton bound [6], [7]. In partcular, our result ams at procurng the followng clam: If an optmal algorthm can schedule a task set on a machne of m untspeed processor cores, then our algorthm can schedule ths task set on m processor cores, each of speed ν, where ν s the resource augmentaton factor. Snce an optmal algorthm s unknown, we pessmstcally assume that an optmal scheduler can schedule a task set f each task of the set has a crtcalpath length no greater than ts deadlne, and the total utlzaton of the task set s no greater than m. Note that no algorthm can schedule a task set that does not meet these condtons. Our resource augmentaton analyss s based on the denstes of the decomposed tasks, where the densty of any task s the rato of ts executon requrement to ts deadlne. We frst present termnology used n decomposton. Then, we present the proposed technque for decomposton, followed by a densty analyss of the decomposed tasks. A. Termnology Our proposed decomposton technque converts each mplct deadlne DAG task nto a set of constraned deadlne sequental tasks, and s based on the followng defntons that are applcable for any task, not lmted to just parallel tasks. The utlzaton u of any task τ, and the total utlzaton u sum (τ) for any task set τ consstng of n tasks are defned as u = C n C ; u sum (τ) = T T If the total utlzaton u sum s greater than m, then no algorthm can schedule τ on m dentcal unt-speed processor cores. The densty δ of any task τ, and the total densty δ sum (τ) and the maxmum densty δ max (τ) for any set τ of n tasks are defned as follows. δ = C n ; δ sum (τ) = δ ; δ max (τ) =max{δ 1 n} D =1 (1) The demand bound functon (DBF) of task τ s the largest cumulatve executon requrement of all jobs generated by τ that have both arrval tmes and deadlnes wthn a contguous nterval of t tme unts. For any task τ, the DBF s gven by ( ( ) t D DBF(τ,t)=max 0, +1 )C () T =1 Based on the DBF, the load, denoted by λ(τ), of any task set τ consstng of n tasks s defned as follows. n DBF(τ,t) λ(τ) =max =1 t>0 t (3) B. Decomposton Algorthm The decomposton algorthm converts each node of a DAG nto an ndvdual sequental subtask wth ts own executon requrement, release offset, and a constraned deadlne. The release offsets are assgned so as to preserve the dependences 3

" (a) τ : a tmng dagram for when τ executes on an nfnte number of processor cores (b) τ syn Fg.. τ and τ syn of DAG τ (of Fgure 1) of the orgnal DAG, namely, to ensure that a node (subtask) can start after the deadlnes of all the parent nodes (subtasks). That s, a node starts after ts latest parent fnshes. The (relatve) deadlnes of the nodes are assgned by splttng the task deadlne nto ntermedate subdeadlnes. The ntermedate subdeadlne assgned to a node s called node deadlne. Note that once task τ s released, t has a total of T tme unts to fnsh ts executon. The proposed decomposton algorthm splts ths deadlne T nto node deadlnes by preservng the dependences n τ. For task τ, the deadlne and the offset assgned to node W j are denoted by D j and Φj, respectvely. Once approprate values of D j and Φ j are determned for each node W j (respectng the dependences n the DAG), task τ s decomposed nto nodes. Upon decomposton, the dependences n the DAG need not be consdered, and each node can execute as a tradtonal sequental multprocessor task. Hence, the decomposton technque for τ bols down to determnng D j and Φj for each node W j as presented below. The presentaton s accompaned by an example usng the DAG τ from Fgure 1. For the example, we assgn executon requrement of each node W j as follows: E 1 =4, E =, E 3 =4, E4 =5, E5 =3, E6 =4, E7 =, E8 =4, E9 =1, E 10 =1. Hence, C =30, P =14. Let perod T =1. To perform the decomposton, we frst represent DAG τ as a tmng dagram τ (Fgure (a)) that shows ts executon tme on an nfnte number of unt-speed processor cores. Specfcally, τ ndcates the earlest start tme and the earlest fnshng tme (of the worst case executon requrement) of each node when m =. For any node W j that has no parents, the earlest start tme and the earlest fnshng tme are 0 and E j, respectvely. For every other node W j, the earlest start tme s the latest fnshng tme among ts parents, and the earlest fnshng tme s E j tme unts after that. For example, n τ of Fgure 1, nodes W 1, W, and W 3 can start executon at tme 0, and ther earlest fnshng tmes are 4,, and 4, respectvely. Node W 4 can start after W 1 and W complete, and fnsh after 5 tme unts at ts earlest, and so on. Fgure (a) shows τ for DAG τ. Next, based on τ, the calculaton of D j and Φj for each node W j nvolves the followng two steps. In Step 1, for each node, we estmate the tme requrement at dfferent parts of the node. In Step, the total estmated tme requrements at dfferent parts of the node s assgned as the node s deadlne. As stated before, we analyze the schedulablty of the decomposed tasks based on ther denstes. The effcency of the analyss s largely dependent on the total densty (δ sum ) and the maxmum densty (δ max ) of the decomposed tasks. Namely, we need to keep both δ sum and δ max bounded and as small as possble (snce a hgher value of densty mples a hgher value of executon requrement to deadlne rato) to mnmze the resource augmentaton requrement. Therefore, the objectve of the decomposton algorthm s to splt the entre task deadlne nto node deadlnes so that each node (subtask) has enough slack. The slack of any task represents the extra tme beyond ts executon requrement and s defned as the dfference between ts deadlne and executon requrement. 1) Estmatng Tme Requrements of the Nodes: In DAG τ, a node can execute wth dfferent numbers of nodes n parallel at dfferent tmes. Such a degree of parallelsm can be estmated based on τ. For example, n Fgure (a), node W 5 executes wth W 1 and W 3 n parallel for the frst tme unts, and then executes wth W 4 n parallel for the next tme unt. In ths way, we frst dentfy the degrees of parallelsm 4

at dfferent parts of each node. Intutvely, the parts of a node that may execute wth a large number of nodes n parallel demand more tme. Therefore, dfferent parts of a node are assgned dfferent amounts of tme consderng these degrees of parallelsm and executon requrements. Later, the total tme of all parts of a node s assgned to the node as ts deadlne. To dentfy the degree of parallelsm for dfferent portons of a node based on τ, we assgn tme unts to a node n dfferent (consecutve) segments. In dfferent segments of a node, the task may have dfferent degrees of parallelsm. In τ, startng from the begnnng, we draw a vertcal lne at every tme nstant where a node starts or ends (as shown n Fgure (b)). Ths s done n lnear tme usng a breadthfrst search over the DAG. The vertcal lnes now splt τ nto segments. For example, n Fgure (b), τ s splt nto 7 segments (numbered n ncreasng order from left to rght). Once τ s splt nto segments, each segment conssts of an equal amount of executon by the nodes that le n the segment. Parts of dfferent nodes n the same segment can now be thought of as threads of executon that run n parallel, and the threads n a segment can start only after those n the precedng segment fnsh. We denote ths synchronous form of τ by τ syn. We frst allot tme to the segments, and fnally add all tmes allotted to dfferent segments of a node to calculate ts deadlne. Note that τ s never converted to a synchronous model; the procedure only dentfes segments to estmate tme requrements of nodes, and does not decompose τ n ths step. We splt T tme unts among the nodes based on the number of threads and executon requrement of the segments where a node les n τ syn. We frst estmate tme requrement for each segment. Let τ syn be a sequence of s segments numbered as 1,,,s. For any segment j, we use m j to denote the number of threads n the segment, and e j to denote the executon requrement of each thread n the segment (see Fgure (b)). Snce τ syn has the same crtcal path and total executon requrements as those of τ, s P = e j ; C s = m j.ej j=1 j=1 For any segment j of τ syn, we calculate a value d j, called the segment deadlne, so that the segment s assgned a total of d j tme unts to fnsh all ts threads. Now we calculate the value d j that mnmzes both thread densty and segment densty that would lead to mnmzng δ sum and δ max upon decomposton. Snce segment j conssts of m j parallel threads, wth each thread havng an executon requrement of e j, the total executon requrement of segment j s m j ej. Thus, the segments wth larger numbers of threads and wth longer threads are computaton-ntensve, and demand more tme to fnsh executon. Therefore, a reasonable way to assgn the segment deadlnes s to splt T proportonally among the segments by consderng ther total executon requrement. Such a polcy assgns a segment deadlne of T to segment j. Snce C m j ej ths s the deadlne for each parallel thread of segment j, by Equaton 1, the densty of a thread becomes whch can C m j T be as large as m (.e. total number of processor cores). Hence, such a method does not mnmze δ max, and s not useful. Instead, we classfy the segments of τ syn nto two groups based on a threshold θ of the number threads per segment: each segment j wth m j >θ s classfed as a heavy segment, and each segment j wth m j θ s classfed as a lght segment. Among the heavy segments, we allocate a porton of tme T that s no less than that allocated among the lght segments. Before assgnng tme among the segments, an mportant ssue s to determne a value of θ and the fracton of tme T to be splt among the heavy and lght segments. We show below that choosng θ = C T P helps us keep both thread densty and segment densty bounded. Therefore, each segment j wth m j > C T P s classfed as a heavy segment whle other segments are called lght segments. Let H denote the set of heavy segments, and L denote the set of lght segments of τ syn. Ths rases three dfferent cases: when L = (.e., when τ syn conssts of only heavy segments), when H = (.e., when τ syn conssts of only lght segments), and when H, L (.e., when τ syn conssts of both lght segments and heavy segments). We use three dfferent approaches for these three scenaros. Case 1: when H =. Snce each segment has a smaller number ( C T P ) of threads, we only consder the length of a thread n each segment to assgn tme for t. Hence, T tme unts s splt proportonally among all segments accordng to the length of each thread. For each segment j, ts deadlne d j s calculated as follows. d j = T e j (4) P Snce the condton T P must hold for every task τ, d j = T e j P T e j T = ej (5) Hence, the maxmum densty of a thread n any segment s at C most 1. Each segment has at most T P threads. Hence, the total densty of a segment s at most C C = C (6) T P T T T Case : when L =. All segments are heavy, and T tme unts s splt proportonally among all segments accordng to the work (.e. total executon requrement) of each segment. For each segment j, ts deadlne d j s gven by Snce for every segment j, m j > d j = T C m j ej (7) C T P,wehave d j = T m j C ej > T C e j C T P = T (T P ) ej ej (8) Hence, the maxmum densty of any thread s at most. The total densty of segment j s at most m j ej d j = mj ej T C m j ej = C T (9) 5

Algorthm 1: Decomposton Algorthm Input: a DAG task τ wth perod and deadlne T, total executon requrement C, crtcal path length P ; Output: node deadlne D j, release offset Φj for each node W j of τ ; for each node W j of τ do Φ j 0; Dj 0; end Represent τ as τ syn ; θ C /(T P ); /* heavy or lght threshold */ total heavy 0; /* total heavy segments */ total lght 0; /* total heavy segments */ C heavy 0; /* total work of heavy segments */ P lght 0; /* lght segments crtcal path len. */ for each j-th segment n τ syn do f m j >θ then /* t s a heavy segment */ total heavy total heavy +1; C heavy C heavy + m j ej ; else /* t s a lght segment */ total lght total lght +1; P lght P lght + e j ; end end f total heavy =0then /* all segments are lght */ for each j-th segment n τ syn do d j = T e j P ; else f total lght =0then /* all segments are heavy */ for each j-th segment n τ syn do d j = T m j C ej ; else /* τ syn has both heavy and lght segments */ for each j-th segment n τ syn do f m j >θ then /* for heavy segment */ d j = T P / C heavy m j ej ; else /* for lght segment */ d j = P / P lght e j ; end end end /* Remove seg. deadlnes. Assgn node deadlne */ for each node W j of τ n breadth-frst search order do f W j syn belongs to segments k to r n τ then D j = dk + dk+1 + + d r ; / * node deadlne */ Φ j max{φl + Dl W l s a parent of W j }; / * offset */ end Case 3: when H and L. The task has both heavy segments and lght segments. A total of (T P /) tme unts s assgned to heavy segments, and the remanng P / tme unts s assgned to lght segments. (T P /) tme unts s splt proportonally among heavy segments accordng to the work of each segment. The total work (executon requrement) of heavy segments of τ syn C heavy s denoted by C heavy = j H m j.ej For each heavy segment j, the deadlne d j d j = T P C heavy Snce for each heavy segment j, m j > d j = (T P )mj ej C heavy, defned as s calculated as m j ej (10) > (T P ) C T P e j C heavy C T P,wehave ej (11) Hence, maxmum densty of a thread n any heavy segment s at most. The total densty of a heavy segment becomes m j ej d j = m j ej T P m j C heavy ej = Cheavy T P C T T = C T (1) Now, to dstrbute tme among the lght segments, P / tme unts s splt proportonally among lght segments accordng to the length of each thread. The crtcal path length of lght segments s denoted by P lght, and s defned as follows. P lght = j L e j For each lght segment j, ts deadlne d j d j = P P lght s calculated as e j (13) The densty of a thread n any lght segment s at most snce d j = P P lght e j P e j P = ej Snce a lght segment has at most densty of a lght segment s at most C T P (14) threads, the total C C = C (15) T P T T T ) Calculatng Deadlne and Offset for Nodes: We have assgned segment deadlnes to (the threads of) each segment of τ syn n Step 1 (Equatons 4, 7, 10, 13). Snce a node may be splt nto multple (consecutve) segments n τ syn,nowwe have to remove all segment deadlnes of a node to reconstruct (restore) the node. Namely, we add all segment deadlnes of a node, and assgn the total as the node s deadlne. Now let a node W j of τ belong to segments k to r (1 k r s )nτ syn. Therefore, the deadlne D j of node W j s calculated as follows. D j = dk + d k+1 + + d r (16) Note that the executon requrement E j of node W j s E j = ek + e k+1 + + e r (17) Node W j cannot start untl all of ts parents complete. Hence, ts release offset Φ j s determned as follows. { Φ j = 0; f W j has no parent max{φ l + Dl W l s a parent of W j }; otherwse. Now that we have assgned an approprate deadlne D j and release offset Φ j to each node W j of τ, the DAG τ s now decomposed nto nodes. Each node W j s now an ndvdual (sequental) multprocessor subtask wth an executon requrement E j, a constraned deadlne Dj, and a release offset Φ j. Note that the perod of W j s stll the same as that of the orgnal DAG whch s T. The release offset Φ j ensures that node W j can start executon no earler than W j tme unts followng the release tme of the orgnal DAG. Our 6

method guarantees that for a general DAG no node s splt nto smaller subtasks to ensure node-level non-preempton. Thus, the (node-level) non-preemptve behavor of the orgnal task s preserved n schedulng the nodes as ndvdual tasks, where nodes of the DAG are never preempted. The entre decomposton method s presented as Algorthm 1 whch runs n lnear tme (n terms of the DAG sze.e., number of nodes and edges). Fgure 3 shows the complete decomposton of τ. C. Densty Analyss after Decomposton After decomposton, let τ dec denote all subtasks (.e., nodes) that τ generates. Note that the denstes of all such subtasks comprse the densty of τ dec. Now we analyze the densty of τ dec whch wll later be used to analyze schedulablty. Let node W j of τ belong to segments k to r (1 k r s )nτ syn. Snce W j has been assgned deadlne D j,by Equatons 16 and 17, ts densty δ j after decomposton s δ j = Ej D j = ek + ek+1 + + e r d k + (18) dk+1 + + d r By Equatons 5, 8, 11, 14, d k ek,, k. Hence, from 18, δ j = Ej D j ek +ek+1 e k + ek+1 + +e r = (19) + + e r Let τ dec be the set of all generated subtasks of all orgnal DAG tasks, and δ max be the maxmum densty among all subtasks n τ dec. By Equaton 19, { } δ max =max 1 j n, 1 n (0) δ j Theorem 1. Let a DAG τ, 1 n, wth perod T, crtcal path length P, and maxmum executon requrement C be decomposed nto subtasks (nodes) denoted τ dec usng Algorthm 1. The densty of τ dec s at most C T. Proof: Snce we decompose τ nto nodes, the denstes of all decomposed nodes W j, 1 j n, comprse the densty of τ dec. In Step 1, every node W j of τ s splt nto threads n dfferent segments of τ syn, and each segment s assgned a segment deadlne. In Step, we remove all segment deadlnes n the node, and ther total s assgned as the node s deadlne. If τ s scheduled n the form of τ syn, then each segment s scheduled after ts precedng segment s complete. That s, at any tme at most one segment s actve. Snce a segment has densty at most C T (Equatons 6, 9, 1, 15), the overall densty of τ syn never exceeds C T. Hence, t s suffcent to prove that removng segment deadlnes n the nodes does not ncrease the task s overall densty. That s, t s suffcent to prove that the densty δ j (Equaton 18) of any node W j deadlnes s no greater than the densty δ j,syn removng ts segment deadlnes. Let node W j (1 k r s )nτ syn after removng ts segment that t had before of τ be splt nto threads n segments k to r. Snce the total densty of any set of tasks s an upper bound on ts load (proven n [7]), the load of the threads of W j must be no greater than the total densty of these threads. Snce each of these threads s executed only once n the nterval of D j, by Equaton, the DBF of the thread, thread l, n segment l, k l r, n the nterval Dj DBF(thread l,d j )=el Therefore, usng Equaton 3, the load, denoted by λ j,syn,of the threads of W j n τ syn for nterval D j s λ j,syn ek D j + ek+1 D j + + er D j = Ej D j = δ j j, syn j, syn Snce δ λ, for any W j syn,wehaveδj, δ j. Let δ sum be the total densty of all subtasks τ dec. Then, from Theorem 1, n C n C δ sum = (1) T =1 T =1 V. PREEMPTIVE EDF SCHEDULING Once all DAG tasks are decomposed nto nodes (.e., subtasks), we consder schedulng the nodes. Snce every node after decomposton becomes a sequental task, we schedule them usng tradtonal multprocessor schedulng polces. In ths secton, we consder the preemptve global EDF polcy. Lemma. For any set of parallel DAG tasks τ = {τ 1,,τ n }, let τ dec be the decomposed task set. If τ dec s schedulable under some preemptve schedulng, then τ s preemptvely schedulable. Proof: In each τ dec, a node s released only after all of ts parents fnsh executon. Hence, the precedence relatons n orgnal task τ are retaned n τ dec. Besdes, for each τ dec, the deadlne and the executon requrement are the same as those of orgnal task τ. Hence, f τ dec s preemptvely schedulable, a preemptve schedule must exst for τ where each task n τ meets ts deadlne. To schedule the decomposed subtasks τ dec, the EDF polcy s the same as the tradtonal global EDF polcy where jobs wth earler absolute deadlnes have hgher prortes. Due to the preemptve polcy, a job can be suspended (preempted) at any tme by arrvng hgher-prorty jobs, and s later resumed wth (n theory) no cost or penalty. Under preemptve global EDF, we now present a schedulablty analyss for τ dec n terms of a resource augmentaton bound whch, by Lemma, s also a suffcent analyss for the orgnal DAG task set τ. For a task set, a resource augmentaton bound ν of a schedulng polcy A on a mult-core processor wth m cores s a processor speed-up factor. That s, f there exsts any way to schedule the task set on m dentcal unt-speed processor cores, then A s guaranteed to successfully schedule t on an m-core processor wth each core beng ν tmes as fast as the orgnal. Our analyss hnges on a result (Theorem 3) for preemptve global EDF schedulng of constraned deadlne sporadc tasks on a tradtonal multprocessor platform [8]. Ths result s a generalzaton of the result for mplct deadlne tasks [9]. Theorem 3. (From [8]) Any constraned deadlne sporadc sequental task set π wth total densty δ sum (π) and maxmum 7

! " (a) Calculatng segment deadlnes of τ syn! # (b) Removng segment deadlnes, and calculatng node deadlnes and offsets Fg. 3. Decomposton of τ (shown n Fgure 1) when T =1 densty δ max (π) s schedulable usng preemptve global EDF polcy on m unt-speed processor cores f δ sum (π) m (m 1)δ max (π) Note that τ dec conssts of constraned deadlne (sub)tasks that are perodc wth offsets. If they do not have offsets, then the above condton drectly apples. Takng the offsets nto account, the executon requrement, the deadlne, and the perod (whch s equal to the perod of the orgnal DAG) of each subtask remans unchanged. The release offsets only ensure that some subtasks of the same orgnal DAG are not executed smultaneously to preserve the precedence relatons n the DAG. Ths mples that both δ sum and δ max of the subtasks wth offsets are no greater than δ sum and δ max, respectvely, of the same set of tasks wth no offsets. Hence, Theorem 3 holds for τ dec. We now use the results of densty analyss from Subsecton IV-C, and prove that τ dec s guaranteed to be schedulable wth a resource augmentaton of at most 4 (Theorem 4). Theorem 4. For any set of DAG model parallel tasks τ = {τ 1,τ,,τ n }, let τ dec be the decomposed task set. If there exsts any algorthm that can schedule τ on m unt-speed processor cores, then τ dec s schedulable under preemptve global EDF on m processor cores, each of speed 4. Proof: If τ s schedulable on m dentcal unt-speed processor cores, the followng condton must hold. n C m () T =1 To be able to schedule the decomposed tasks τ dec, let each processor core be of speed ν. Onanm-core platform where each core has speed ν, let the total densty and the maxmum densty of task set τ dec be denoted by δ sum,ν and δ max,ν, respectvely. From Equaton 0, we have δ max,ν = δ max (3) ν ν Based on Equaton, when each processor core s of speed ν, the total densty of τ dec gven n Equaton 1 becomes δ sum,ν = δ n sum = n C m (4) ν ν T ν =1 C ν T =1 Usng Equatons 3 and 4 n Theorem 3, τ dec s schedulable under preemptve EDF on m cores each of speed ν f m ν m (m 1) ν 4 ν mν 1 From the above condton, τ dec must be schedulable f 4 ν 1 ν 4. 8

VI. NON-PREEMPTIVE EDF SCHEDULING We now address non-preemptve global EDF schedulng consderng that the orgnal task set τ s scheduled based on node-level non-preempton. In node-level non-preemptve schedulng, whenever the executon of a node n a DAG starts, the node s executon cannot be preempted by any task. Most parallel languages and lbrares have yeld ponts at the ends of threads (nodes of the DAG), where they allow low cost, user-space preempton. For these languages and lbrares, schedulers that swtch context only when threads end (n other words, where threads do not preempt each other) can be mplemented entrely n user-space (wthout nteracton wth the kernel), and therefore have low overheads. The decomposton converts each node of a DAG to a tradtonal multprocessor (sub)task. Therefore, we consder fully non-preemptve global EDF schedulng of the decomposed tasks. Namely, once a job of a decomposed (sub)task starts executon, t cannot be preempted by any other job. Lemma 5. For a set of DAG parallel tasks τ = {τ 1,,τ n }, let τ dec be the decomposed task set. If τ dec s schedulable under some fully non-preemptve schedulng, then τ s schedulable under node-level non-preempton. Proof: Snce the decomposton converts each node of a DAG to an ndvdual task, a fully non-preemptve schedulng of τ dec preserves the node-level non-preemptve behavor of task set τ. The rest of the proof follows from Lemma. Under non-preemptve global EDF, we now present a schedulablty analyss for τ dec n terms of a resource augmentaton bound whch, by Lemma 5, s also a suffcent analyss for the DAG task set τ. Ths analyss explots Theorem 6 for non-preemptve global EDF schedulng of constraned deadlne perodc tasks on tradtonal multprocessor. The theorem s a generalzaton of the result for mplct deadlne tasks [30]. For a task set π, let C max (π) and D mn (π) be the maxmum executon requrement and the mnmum deadlne among all tasks n π. In non-preemptve schedulng, C max (π) represents the maxmum blockng tme that a task may experence, and plays a major role n schedulablty. Hence, a non-preempton overhead, defned n [30], for the task set π s gven by ρ(π) = C max(π) D mn(π). The value of ρ(π) ndcates the added penalty or overhead assocated wth non-preemptvty. In other words, snce preempton s not allowed, the capacty of each processor s reduced (at most) by a factor of ρ(π). In non-preemptve schedulng, ths capacty reducton s recompensed by reducng the cost assocated wth context-swtch, savng state etc. Theorem 6. (From [30]) Any constraned deadlne perodc task set π wth total densty δ sum (π), maxmum densty δ max (π), and a non-preempton overhead ρ(π) s schedulable usng non-preemptve global EDF on m unt-speed cores f δ sum (π) m ( 1 ρ(π) ) (m 1)δ max (π) Let E max and E mn be the maxmum and mnmum executon requrement, respectvely, among all nodes of all DAG tasks. In node-level non-preemptve schedulng of the DAG tasks, the processor capacty reducton due to non-preemptvty E s at most max E mn. Hence, ths value s the non-preempton overhead of the DAG tasks, and s denoted by ρ: ρ = E max (5) E mn Theorem 7 derves a resource augmentaton bound of 4+ρ for non-preemptve global EDF schedulng after decomposton. Theorem 7. For DAG model parallel tasks τ = {τ 1,,τ n }, let τ dec be the decomposed task set wth non-preempton overhead ρ. If there exsts any way to schedule τ on m untspeed processor cores, then τ dec s schedulable under nonpreemptve global EDF on m cores, each of speed 4+ρ. Proof: After decomposton, let D mn be the mnmum deadlne among all subtasks n τ dec. Snce E max (.e. the maxmum executon requrement among all subtasks n τ dec ) represents the maxmum blockng tme that a subtask may experence, the non-preempton overhead of the decomposed tasks s Emax D mn. From Equatons 19 and 5, the non-preempton overhead of the decomposed tasks E max E max D mn E mn / = E max =ρ (6) E mn Smlar to Theorem 4, suppose we need each processor core to be of speed ν to be able to schedule the decomposed tasks τ dec. From Equaton 6, the non-preempton overhead of τ dec on ν-speed processor cores s E max /ν D mn ρ ν (7) Now consderng a non-preempton overhead of at most ρ ν on ν-speed processor cores, and usng Equatons 3 and 4 n Theorem 6, τ dec s schedulable under non-preemptve EDF on m cores each of speed ν f m ν ρ m(1 ν ) (m 1) ν 4+ρ 1 ν mν 1 From the above condton, task set τ dec s schedulable f 4+ρ ν 1 ν 4+ρ. VII. EVALUATION The derved resource augmentaton bounds provde a suffcent condton for schedulablty. Namely, f a set of DAG tasks s schedulable on a unt-speed m-core machne by a (potentally unrealzable) deal scheduler, then the tasks upon our proposed decomposton are guaranteed to be schedulable under global EDF on an m-core machne where each core has a speed of 4 (wth preempton) or 4+ρ (wthout preempton). In ths secton, we evaluate our scheduler usng smulatons. We want to accomplsh two thngs. Frst, we want to valdate that our theoretcal bounds are correct, that s, an augmentaton of 4 for preemptve EDF (or 4+ρ for non-preemptve EDF) s suffcent to schedule any task set that an deal scheduler can schedule. Second, we want to see how effectve our schedulng strategy s n practce and, f the bounds are an 9

accurate representaton of how much augmentaton s needed n practce. We do not compare wth any baselne snce no other strateges for real-tme schedulng of general DAGs exst. A. Task and Task Set Generaton We want to evaluate our scheduler usng task sets that an optmal scheduler could schedule on 1-speed processors. However, as we cannot determne ths deal scheduler, we assume that an deal scheduler can schedule any task set whose total utlzaton s no greater than m, and that each ndvdual task s schedulable n solaton (that s, ts crtcal path length s no greater than ts deadlne). Therefore, n our experments, for each value of m (.e. the number of processor cores), we generate task sets whose utlzaton s exactly m, fully loadng a machne of 1-speed processors. We use the Erdös-Rény method G(n,p) [31] to generate task sets for evaluaton. The precse methodology s as follows. Number of nodes. To generate a DAG τ, we pck the number of nodes n unformly at random n range [50, 350]. We found that these values would allow us to generate vared task sets wthn a reasonable amount of tme. Addng edges. We add edges to the graph usng the Erdös- Rény method G(n,p) [31]. We scan all the possble edges drectng from lower node d to hgher node d to avod ntroducng a cycle nto the graph. For each possble edge, we generate a random value n range [0, 1] and add the edge only f the generated value s less than a predefned probablty p. (We wll vary p n our experments to explore the effect of changng p.) Fnally, we add an addtonal mnmum number of edges so that each node (except the frst and the last node) has at least one ncomng and one outgong edge n order to make the DAG weakly connected. Note that the crtcal path length of a DAG generated usng the pure Erdös-Rény method ncreases as p ncreases. However, snce our method s slghtly modfed, the crtcal path s also large when p s small. Therefore, as p ncreases, the crtcal path frst decreases up to a certan value of p and then ncreases agan. Executon tme of nodes. We assgn every node an executon tme chosen randomly from a specfed range. The range s based on the value and type (contnuous or dscrete) of the non-preempton overhead ρ (explaned n the next subsecton). At ths pont, we have the DAG structure and the executon tmes for ts nodes. For each DAG τ, we now assgn a perod T (whch s also ts deadlne) that s no less than the crtcal path length P. We consder two types of task sets: Task sets wth harmonc perods. These deadlnes are carefully pcked so that they are multples of each other, so as to ensure that we can run our experments up to the hyper-perod of the task sets. In partcular, we pck deadlnes that are powers of two. We fnd the smallest value a such that P a, and randomly set T to be one of a, a+1,or a+. These choces for perod are due to the fact that we want some hgh utlzaton tasks and some low utlzaton tasks. The rato P /T of the task s n the range [1, 1/], (1/,1/4], or(1/4, 1/8], when ts perod T s a, a+1,or a+, respectvely. Task sets wth arbtrary perods. We frst generate a random number Gamma(, 1) usng the gamma dstrbuton [3]. Then we set perod T to be (P + C 0.5m ) (1+0.5 Gamma(, 1)). We choose ths formula for three reasons. Frst, we want to ensure that the assgned value s a vald perod,.e., P T. Second, we want to ensure that each task set contans a reasonable number of tasks even when the number of cores s small. At the same tme, wth more cores, we do not want to lmt average DAG utlzaton to a certan small value. Hence the mnmum perod s a functon of m. Thrd, whle we want the average perod to be close to the mnmum vald perod (to have hgh utlzaton tasks), we also want some tasks wth large perods. Table I shows the average number of DAGs per task set acheved by the random perod generaton process. TABLE I NUMBER OF TASKS PER TASK SET p 0.01 0.05 0.1 0. 0.4 0.6 0.8 m 4 4 4 4 5 6 7 8 8 4 4 5 7 9 11 13 16 4 6 7 10 15 19 3 5 8 11 17 6 34 41 To create a task set we combne ndvdual DAGs as follows. We add DAGs to the task set untl the total utlzaton of the set exceeds m. We then remove the last generated DAG. Thus, at ths pont, the total utlzaton s smaller than m. To make the total utlzaton exactly m, we add small DAGs wth long perods (and therefore small utlzaton). We stop addng small DAGs when the total utlzaton s larger than 99% of m. B. Expermental Methodology We run experments by varyng the followng 4 parameters. Harmonc vs. arbtrary perods. We want to evaluate whether arbtrary perods are better or worse than harmonc ones. For harmonc perod task sets, we run smulaton up to ther hyper-perod. For arbtrary perod task sets, the hyperperod can be too long to smulate, and hence we run smulaton up to 0 tmes the maxmum perod. Number of cores (m). We want to evaluate f parallel schedulng s easer or harder as the number of cores ncreases. We run experments on m: 4, 8, 16, and 3. Probablty of an edge (p). As stated before, p affects the crtcal path length, the densty, and the structure of the DAG. We test usng 14 values of p: 0.01, 0.0, 0.03, 0.05, 0.07, 0.1, 0., 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. Non-preempton overhead (ρ). Ths s the rato of the maxmum node executon length to the mnmum node executon length. For non-preemptve EDF schedulng, the resource augmentaton bound ncreases as ρ ncreases. We want to evaluate whether the effect of ncreased ρ s really that severe n practce. For all of our experments, we set the mnmum node executon requrement to be 50, and vary the maxmum executon requrement. To get ρ =1,, 5, and 10, the maxmum executon requrements are chosen to be 50, 100, 50, and 500, respectvely. In addton, when we evaluate the performance of non-preemptve EDF, we want to maxmze the nfluence of ρ. Therefore, besdes usng unformly generated 10