Energy-Efficient Primary/Backup Scheduling Techniques for Heterogeneous Multicore Systems

Size: px

Start display at page:

Download "Energy-Efficient Primary/Backup Scheduling Techniques for Heterogeneous Multicore Systems"

Bernice Barnett
5 years ago
Views:

1 Energy-Effcent Prmary/Backup Schedulng Technques for Heterogeneous Multcore Systems Abhshek Roy, Hakan Aydn epartment of Computer Scence George Mason Unversty Farfax, Vrgna aka Zhu epartment of Computer Scence Unversty of Texas at San Antono San Antono, TX Abstract In ths paper, we consder energy-effcent and faulttolerant schedulng of real-tme tasks on heterogeneous multcore systems. Each task conssts of a man copy and a backup copy whch are scheduled on dfferent cores, for fault tolerance purposes. Our framework delberately delays the backup tasks n order to cancel them dynamcally when the man task copes complete successfully (wthout faults). We dentfy and address two dmensons of the problem,.e., parttonng tasks and determnng processor voltage/frequency levels to mnmze energy consumpton. Our expermental results show that our proposed algorthms performance levels are close to that of an deal soluton wth optmal (but computatonally prohbtve) parttonng and frequency assgnment components. I. INTROUCTION Energy management remans a crucal component for the desgn and mplementaton of embedded systems, ncludng those deployed n safety-crtcal and tme-crtcal applcatons such as those n ndustral control, avoncs, and hghconfdence medcal systems. Recently, heterogeneous (asymmetrc) multcore systems have been embraced by the ndustry due to ther power-effcent desgn and the flexblty they offer n dealng wth dfferent types of workloads. These systems typcally combne the hgh-performance bg cores wth lttle cores that consume less power, at the cost of provdng more modest performance. ARM s bg.little systems that nclude out-of-order and fast cores (such as ARM Cortex-A15) and n-order, energy-effcent cores (such as ARM Cortex A-7) are among the most well-known examples [1]. The man dea n deployng heterogeneous multcore systems s to execute the workload at hand by the core most sutable for the current performance objectve (hgh performance or energy savngs). The research communty has recently addressed several aspects of heterogeneous multcore systems wth a mult-dmensonal effort [2], [3]. Another ncreasngly mportant dmenson, n partcular for safety-crtcal embedded systems, s relablty. Those systems should be able to detect, and recover from, varous types of faults n a tmely manner [4]. The majorty of run-tme faults are categorzed as transent, n that they are shortlved they are typcally nduced by the phenomena such as electromagnetc nterference and cosmc rays. However, they result n erroneous task computaton, and typcally a recovery task, n the form of an alternatve task or a re-executon, s nvoked [4], [5], [6]. It has been reported that the transent faults occurrence rate s ncreasng, n partcular due to the use of aggressve power management technques such as nearthreshold voltage operaton [7]. On the other hand, n case of permanent faults, a processng core becomes unavalable ths s typcally due to the agng effects, harsh envronmental condtons, and manufacturng defects. Toleratng permanent faults requres the deployment of addtonal hardware (such as another core) that can take over the executon of the tasks orgnally allocated to the affected unt [4]. In ths paper, we propose mplementng a fault-tolerant framework on heterogeneous dual-core systems whle keepng the energy consumpton at a mnmum level. Specfcally, we consder a set of real-tme tasks where each task conssts of a prmary (man) copy and a backup copy, that are allocated to dfferent cores. Ths allows the system to tolerate the permanent fault of any sngle core, snce each processor has exactly one copy of each task (prmary or backup) [4]. Moreover, the transent faults detected n all prmary tasks can be recovered from by the executon of the respectve backup task. Our work dffers from the exstng so-called standbysparng frameworks [8], [9], n that:.) we allow schedulng a mx of prmary and back-up tasks on each processor, and,.) we consder heterogeneous multcore systems. Although n [10] we nvestgated a smlar problem n the context of heterogeneous dual-core systems, the focus was agan the standby-sparng confguraton, and the mxng of prmary and backup copes on a gven core was not consdered. To keep the energy consumpton under control, the backup tasks are delayed as much as possble on ther correspondng processors, because a backup can be canceled as soon as the correspondng prmary completes successfully (.e., wthout a fault). Ths also gves a chance to apply ynamc Voltage and Frequency Scalng (VFS) wth maxmum effcency durng the executon of the prmary tasks on each core. We develop and propose schemes,.) to partton all prmary and backup tasks, and,.) assgn frequency (speed) to all the prmary tasks to mnmze the energy consumpton, whle meetng tmng and fault tolerance constrants. Our expermental results suggest that the lst-schedulng based parttonng technques, coupled wth a speed assgnment approach that dynamcally avods the overlaps wth the /17/$31.00 c 2017 IEEE

2 backups, exhbt superor performance whch s close to the theoretcal lower bound n terms of energy consumpton. Our framework drectly ncorporates a salent feature of heterogeneous cores, namely the fact that the energy consumpton and executon tme fgures of dfferent tasks scale by dfferent ratos when executed on dfferent cores [11]. II. SYSTEM MOEL AN ASSUMPTIONS A. Platform and Applcaton model We consder a heterogenous dual-core system wth a hghperformance (bg) core and a low-power (lttle) core. The hghperformance and low-power cores are denoted by HP and, respectvely, throughout the paper. The cores are assumed to have the same nstructon set archtecture, mplyng that the executable of a task can run on ether core. The workload conssts of n ndependent real-tme tasks {,..., τ n } that wll be executed on ths dual-core platform. We assume the frame-based executon model [6], [12] n whch all tasks have the same perod, whch s equal to the common deadlne. Each processng core s equpped wth the ynamc Voltage and Frequency Scalng (VFS) feature that allows changng the frequency (processng speed) at runtme. Moreover, the ynamc Power Management (PM) feature allows a gven core to swtch to a low-power (dle) mode when t s not actvely executng tasks. A task τ that requres C number of cycles on a gven core may take up to W = C /f unts of executon tme on that core, f executed at the frequency level f. ue to the archtectural dfferences, a task s requred number of cycles, and hence executon tme, can be dfferent on the HP and cores. Therefore, we use superscrpts HP and to denote the varables on the HP or the core (C, W, C HP, W HP ). We defne the nomnal utlzaton of a task τ as (C HP /). The maxmum frequency levels supported by the HP and cores are denoted by fmax HP and fmax, respectvely. We assume fmax HP = 1.0, and normalze all other frequency values wth respect to that value. Assocated wth each (prmary) task τ, there s a backup task B wth exact same tmng parameters as those of τ. τ and B are allocated to dfferent cores: should a permanent fault affect any of the processng cores, the alternatve core can take over and fnsh the workload before deadlne. When a prmary copy completes, the acceptance (or, santy) tests [4] are performed to check the exstence of errors nduced by transent faults. If a fault s not detected, the correspondng backup copy (or, ts remanng part) on the other core s canceled. Otherwse, the backup copy runs to completon. B. Power Model The power consumpton characterstcs of the HP and cores dffer by desgn. For any processng core, the dynamc power consumpton of an executng task τ s modeled as, P (f) = a f 3 + α, where a denotes the swtchng capactance, α denotes the frequency-ndependent power consumpton, and f s the processng frequency of the task. ue to the asymmetry of the cores, these parameters are dfferent for each core and agan we use superscrpts HP and to denote core-specfc power parameters (e.g., P HP, α HP ). Each core executes tasks n the actve state, dsspatng power as determned by the characterstcs of the current task and processng frequency. When a core does not execute tasks, t remans n the low-power (dle) state. The low-power (dle) power consumpton of the hgh-performance and low-power cores are denoted by Pdle HP and Pdle, respectvely. We assume those fgures nclude the statc power consumpton of the correspondng core as well. The energy consumpton durng a tme nterval s gven by the aggregate power consumpton durng the same nterval. Exstng research ndcates that scalng down the frequency below a certan threshold s no longer effectve for savng energy, due to the mpact of the frequency-ndependent power component [12]. Ths threshold frequency, known as the energy-effcent frequency (f ee ) can be derved through analytcal technques [12]. Problem Statement: Gven a set of real-tme tasks and a heterogeneous dual-core system, mnmze the energy consumpton by determnng 1) The allocaton of tasks such that the prmary and backup copy of each task are assgned to dfferent cores, and, 2) The processng frequency (speed) assgnment to ndvdual tasks. In the followng secton, we nvestgate these two nterconnected dmensons and propose several effcent schemes. III. PROPOSE SCHEMES Before descrbng the specfc algorthms that we propose, we present a number of general prncples that gude our soluton framework. To start wth, n general, the concurrent executon of a prmary task and ts backup, though possble, s not desrable because t ncurs the full energy cost of the backup executon (Fgure 1a). However, n case when the backup s executon can be delayed, by the tme the prmary completes successfully, ts remanng part can be cancelled (Fgure 1b) 1. (a) Executon wth full prmarybackup overlap (b) Executon wth partal overlap Fg. 1: Concurrent Executon of Prmary and Backup Tasks Ths further suggests that on a gven core, all the prmary tasks must execute before the backup tasks allocated to that core. Moreover, provsons are made to execute all backup tasks at the maxmum frequency on ther respectve cores, 1 Throughout the paper, we show the cancelled part of the backup tasks by dashed patterns n all the fgures.

3 TABLE I: Example Task Set 1 τ 3 B 4 B 5 Fg. 2: Canoncal Executon Order W HP W E HP E τ τ should there be a need obvously ths choce mnmzes ther overlap wth ther respectve prmary tasks on the other core, and n addton, snce faults are rare events, the full speed executon of the backups has only a mnmal mpact on the average-case energy consumpton. Clearly, ths choce also leaves maxmum slow-down opportuntes for the prmary tasks scheduled on that core through VFS. Thus, we defne the canoncal executon order, n whch on a gven core all prmary tasks are started as soon as possble, whereas backup tasks are delayed as much as possble subject to the deadlne constrants, and executed at the maxmum frequency f needed. Fgure 2 shows a canoncal executon on a sngle processng core to whch three prmary tasks (,, τ 3 ), and two backup tasks (B 4 and B 5 ) are assgned. In the rest of the paper, we commt to ths canoncal executon order to execute prmary and backup copes of tasks on all cores, once the parttonng s done. A related framework s the so-called energy-aware standbysparng technque, n whch, one of the cores s desgnated for the prmary tasks and the other one for the backup tasks exclusvely [8], [9], [10]. In our framework, however, for maxmum flexblty, we allow schedulng the prmary and backup copes on both cores, when possble for that reason, we call our framework mxed prmary backup (MPB) assgnment. The schemes we propose consst of task parttonng and speed (frequency) assgnment phases whch are descrbed next. A. Task parttonng Task parttonng, n general, s an ntractable problem; however, a well-known approach s based on the lst-schedulng technque. We frst descrbe two varants based on lst schedulng for our task parttonng phase. Lst-schedulng wth Prmares (). In ths algorthm, we consder the prmary copes of the tasks and employ lstschedulng algorthm to allocate them. Frst, the tasks are ordered accordng to ther decreasng nomnal utlzatons. Then, each prmary task s placed on a processng core that has the maxmum free capacty after the placement. Free capacty C ), where Γ p s the on a core s defned by (f max τ ɛγ p set of all prmary tasks assgned to that core, augmented by the task under consderaton. f max and {C } values are defned n the context of the core under consderaton. Observe that the frst few prmary tasks wll always go to the HP core, untl ts free capacty matches that of the core. Once the dstrbuton of the prmary tasks s complete, a backup copy for each prmary task s allocated to the alternate core. Also, at each stage of the prmary task allocaton, the feasblty of both cores, n terms of tme constrants, are checked. We llustrate the behavor of the algorthm on an example task set gven n Table I. The table gves task executon tmes (n ms), and energy consumpton (E HP, E ) on both cores (n mj), under respectve maxmum frequences. The 4-task set s scheduled on a dual core system wth fmax HP = 1.0 and fmax = 0.8. We also assume Pdle HP = 0.05 and Pdle = 0.02, and for all tasks, a HP = 1.0, a = 0.3, α HP = 0.1 and α = For demonstraton, we use a smple runtme polcy (called statc polcy) n whch, each prmary task s slowed down as much as possble wthout volatng the frame deadlne. The canoncal executon order s adopted on each core. Fgure 3b shows the task allocaton under ths scheme for our example task set n Table I. The frst task, s allocated to the HP core, because t has the most free capacty among the two cores. s allocated to the core whose free capacty s hgher at that tme. Smlarly, tasks τ 3 and τ 4 are allocated to the HP core. It should be noted that, n contrast to the standby-sparng confguraton shown n Fgure 3a (whch uses the parttonng method SlowerP, one of the best-performng scheme n [10]), the extent of prmary-backup overlapped executons s much less n the soluton. Lst-schedulng wth Backups (). Ths algorthm works n the same way as, but ths tme, the backup copes of the tasks are consdered whle parttonng. Once the backup copes are dstrbuted, ther correspondng prmary copes are allocated to the respectve alternate processng cores. By ts very nature, ths algorthm tends to allocate a few ntal prmary tasks to the core, before ther backups are allocated to the HP core thanks to the rule. Fgure 3c shows the task allocaton under ths scheme for our example task set n Table I. Ths parttonng s a mrror mage of the parttonng. It can be noted that, all prmarybackup overlapped executons are avoded. Fxed-Threshold Algorthm (). In ths algorthm, the prmary tasks are at frst ordered accordng to ther decreasng nomnal utlzatons and processed one by one. Tasks are assgned to the core, as long as ts load does not exceed a pre-defned threshold value. Otherwse the prmary task s assgned to the HP core. After each prmary task assgnment, ts backup copy s allocated to the counterpart core. The threshold value can assume any value between 0.0 and 1.0. For our example Task Set 1, ths heurstc produces the taskallocaton shown n Fgure 3d when the threshold value s 0.6. Tasks and are allocated to the core. When task τ 3 s processed, the total used capacty on the core exceeds % f t s assgned to the core. Therefore, t s assgned to the HP core. Smlarly, τ 4 s allocated to the HP core.

4 B 3 B 4 τ 4 τ 3 B 4 B 3 τ 4 τ 3 B 4 B 3 τ 4 τ 3 B 4 B (a) Standby-sparng [10] (b) Lst-schedulng wth prmares (c) Lst-schedulng wth backups Fg. 3: Task parttonng algorthms (d) Fxed-threshold algorthm HP: τ 3 τ 4 HP: HP: HP: : B 3 B 4 : B 3 B 4 : B 4 : B 4 r HP r r HP r r HP r r HP r (a) Intal parttonng (b) After speed assgnment Fg. 4: Statc Speed Assgnment (a) ynamc Backup Cancellaton Fg. 5: ynamc Polces (b) ynamc Backup Cancellaton wth Mnmum Overlap B. Speed assgnment Once the task parttonng phase s complete, the next step s to determne the speed (frequency) of the prmary tasks on each core, whle commttng to the canoncal executon order. Speed assgnment to the prmary tasks s crtcal not only because t determnes drectly the prmary s energy consumpton, but also ndrectly, that of the correspondng backup whose overlap extent may change as a result of that assgnment. Below we propose three speed assgnment polces. Statc Speed Assgnment (SSA). Fgure 4 llustrates the basc prncples of the SSA polcy. The scheme reserves capacty for each allocated backup task (whch runs at the maxmum frequency of the core), and assgns a latest-starttme to each of them such that no deadlnes are mssed. In Fgure 4a, r HP and r denote the latest start tme for the frst backup task on the HP and cores, respectvely. Prmary tasks are slowed down as much as possble, subject to the energy-effcent frequency bound (f ee ). Lettng r denote the start tme of the frst backup task on a specfc core, and Γ P denote the set of all prmary tasks on that core, then, the common frequency that fnshes all these prmary tasks before tme r s gven by f U = ( τ ɛγ P C )/r. Then, each prmary task τ s assgned the frequency f = Max(f ee, f U ). Fgure 4b shows the extended executon tmes for prmary tasks, derved through ths prncple. ynamc Backup Cancellaton (BC). In ths scheme, as n SSA, the processng capacty s reserved for backup tasks and prmary copes are slowed down as much as possble, subject to the energy-effcent frequency. However, the speed assgnment routne s re-nvoked at runtme: each tme a prmary task completes wthout fault, the reserved capacty for ts backup copy s deallocated and used to further slow down the next prmary tasks on that core. For example when τ 3 fnshes wthout error, the reserved capacty for B 3 on the core s reclamed to further slow-down (Fgure 5a). Note that, ths ntroduces some overlapped executon for. In general, when task τ s about to run at tme t, ts speed s chosen as f U = ( τ C ɛγ )/(r t), where Γ s the set of unfnshed prmary tasks on the same core, and r represents the earlest start tme among the unfnshed backup tasks, agan on the same core. When a prmary task completes wthout error, the earlest backup actvaton tme on the alternate processng core s updated at runtme. The chosen speed value s subject to the energy-effcent frequency, therefore, for each task τ, the speed s set to f = Max(f ee, f U ). ynamc Backup Cancellaton wth Mnmum Overlap (MO). Ths scheme works as the BC scheme; but when settng the speed of the prmary tasks at run-tme, t attempts to mnmze the overlapped-executon wth back-ups. As shown n Fgure 5b, when VFS s appled to at the begnnng of ts executon, t s not maxmally slowed down; nstead, the overlapped executon wth s avoded by runnng somewhat faster than the BC polcy. Under ths polcy, the speed of τ s chosen to be f = Mn(f max, f ) where f = C r, where r t s the latest tme the backup copy of τ can be actvated (on the alternatve core) wthout volatng any deadlnes, and t s the current tme. Ths speed s subject to the deadlne constrant and the energy-effcent speed, therefore, f s updated as f = Max(f, f ee, f U ). In ths scheme, f U s re-computed wth a dynamcally updated r value as n the BC scheme. TABLE II: Example Task Set 2 W HP W E HP E τ τ To contrast the mpact of these schemes, we use the 4-

5 f = 0.29 f = 0.29 f = 0.29 f = 0.7 f = 0.7 f = 0.7 f = 0.54 f = 0.7 f = 0.8 B 3 B (a) Statc speed assgnment (SSA) (b) ynamc backup cancellaton (BC) Fg. 6: Executon under dfferent schemes (c) ynamc cancellaton wth mnmum overlap (MO) task set n Table II wth a HP 0.1 and α = 1.0, a = 0.3, α HP = 0.03 for each task. The task set s executed = on a dual core system wth fmax HP = 1.0 and fmax = 0.8. We also assume Pdle HP = 0.05 and Pdle = Fgure 6a shows the executon of the task set under parttonng and statc speed assgnment. The HP core (at the top) uses the energy-effcent frequency for tasks τ 3 and τ 4, and the core (at the bottom) s slowed down maxmally (f = 0.7) so that all backup copes (B 3 and B 4 ) can make ther deadlne. The overall energy consumpton s 24.7 mj. Fgure 6b shows the executon of the same task set under parttonng and BC polcy. The scheme reclams the reserved capacty for the backup copes B 3 and B 4 whose prmares complete wthout fault, and uses ths capacty to further slow down the prmary task to speed f = However, ths ntroduces overlapped executon for, and n ths case, hurts the energy savngs. The overall energy consumpton of ths system s 36.7 mj. Fnally, Fgure 6c shows the executon under parttonng and MO runtme polcy. Although ths scheme could use all the reclamed capacty from B 3 and B 4, t runs at the maxmum speed of the core (f = 0.8) to mnmze the overlap wth. Ths executon yelds an overall energy consumpton of 20.2 mj, whch s 18% lower than that of the statc polcy. IV. EXPERIMENTAL EVALUATION We evaluated the energy consumpton performance of the proposed algorthms n a dscrete event smulator. We smulated dual core systems wth fmax HP = 1.0 and fmax vared from 0.6 to 1.0. ue to space lmtatons, we wll show the results for fmax = 0.8, and analyze the mpact of varyng fmax separately n Secton IV-C. It s known that the power parameters and requred number of cycles for dfferent tasks scale dfferently on heterogeneous systems [11]. Therefore, as n [10], we defne tscale = C, C HP whch models how executon tme changes on the core for a gven task, τ. Moreover, followng [10], we defne pscale to be the rato of power consumpton of τ on the core to that on the HP core. Therefore, pscale = P, whch s also P HP assumed to be the same as a = α a HP. α HP For each experment, the smulator generates a task set contanng n tasks, and a gven total utlzaton, U. The utlzaton value s calculated wth respect to the core (whch s more constraned n terms of performance) and normalzed consderng ts maxmum speed. Hence, U = ( C )/fmax. Based on the target U, we use the RandFxedSum algorthm [13] to assgn a random utlzaton (accordng to unform dstrbuton) to each task such that the total utlzaton equals U. We set the frame deadlne = ms. Next, for each task a tscale and a pscale value are chosen randomly wthn ranges suggested n [11]. Specfcally, 1.4 tscale 2.3 and 1.4 1/(tscale pscale ) 2.1 hold. We assume for all tasks, a HP = 1.0 and α HP = 0.1. In addton, Pdle HP = 0.05 and Pdle = 0.02 for all experments. Each generated task set s parttoned upon the HP and cores accordng to one of the proposed parttonng algorthm. For every partton obtaned n ths way, we smulate the executon accordng to the speed assgnment polces that we suggested, and record the energy consumpton. Every combnaton of a parttonng scheme and a speed assgnment algorthm gves us a vald overall algorthm, whose name s ndcated by the concatenaton of the member schemes (e.g., -SSA, -MO). We use task sets wth n = 10 n all the results shown, but we dscuss the mpact of varyng the number of tasks n Secton IV-C. Every reported data pont s the average of 3000 runs. We report the average energy consumpton n fault-free executons, snce faults are very rare events. The obtaned energy consumpton numbers are normalzed wth respect to the maxmum energy consumpton (observed n the consdered parameter spectrum) of a standby-sparng system wth statc speed assgnment and n whch all the prmary copes are allocated to the core [10]. ue to the multple dmensons of the problem and large number of scheme combnatons, n our evaluaton, we wll adopt a herarchcal approach. We wll frst dscuss the performance of the parttonng algorthms by fxng the speed assgnment polcy. Next, we wll compare the performance of the proposed speed assgnment polces, and also nvestgate the mpact of the chosen threshold value on the algorthm. Fnally, we show the effect of the maxmum speed of the core and the effect of the number of tasks. A. Evaluaton of Parttonng Algorthms We mplemented the followng parttonng schemes n our smulator:

6 Utlzaton (%) TSCALE PSCALE (a) Impact of utlzaton (b) Impact of tscale (c) Impact of pscale Fg. 7: Performance of parttonng algorthms Lst-schedulng wth Prmares () Lst-schedulng wth Backups () Fxed-threshold Algorthm () Standby-sparng () Optmal Parttonng () The optmal parttonng we show n the plots s obtaned by exhaustvely enumeratng all possble task allocatons, and measurng ther runtme energy consumpton, then choosng the best. Ths s mplemented by the exhaustve search whch becomes mpractcal when the number of tasks grows beyond 15. The algorthm s adopted from the SlowerP scheme n [10], because t s shown to be the best-performng one n ts respectve context. The threshold value for the algorthm s fxed as 0.6. The energy consumpton of the parttonng algorthms s shown usng the statc speed assgnment algorthm (SSA); we obtaned smlar trends wth the other (BC and MO) algorthms. Impact of Utlzaton. In Fgure 7a, we show the mpact of utlzaton on normalzed energy consumpton. When the utlzaton s low, the algorthm s performance approaches the optmal one, suggestng that allocatng all prmary tasks to the core, and all the (delayed) backups to the HP core s the best strategy. Ths s because under low load, can fnsh the prmary workload quckly and n a power-effcent way, allowng the backup tasks to get cancelled on the HP core early. Ths s evdent for the scheme too, because t allocates all the prmary workload to one core as well. As the load ncreases, drfts from the optmal scheme and becomes a comparable scheme. Ths s due to the fact that, as the load grows, a more balanced parttonng s preferable whch can allow a sutable dstrbuton of the reserved space for backup copes such that ther actvaton s seldom needed. Both and gve relatvely balanced parttonngs, but generally allocates more prmary copes to the core, wth an energy advantage. scheme, performng very poorly on the low-load case, starts to outperform both and when the utlzaton exceeds %, and comes wthn 5% of the optmal scheme. For heavy load, executng prmary copes on the HP core s preferable because n ths case, the backup copes cannot, n general, get cancelled and executng them at the maxmum speed of the core s preferable to executng them at the maxmum speed of the HP core. For the same reasons, performs the worst for heavy load cases. Impact of tscale. Fgure 7b shows the mpact of tscale on the performance of the parttonng algorthms. tscale s vared wthn the range of 1.4 to 2.3, whch s obtaned from [11]. In general, larger tscale values ndcate that tasks take much longer to complete on the core, despte ts power-effcency. In these experments the utlzaton s fxed at %, and therefore, ncreasng tscale mples addtonal unused capacty on the HP core. We see that performs consstently wthn 3% of the optmal scheme throughout the entre range of tscale. Ths s because executng the prmary copes of the workload on the power-effcent core results n less energy consumpton, and tends to allocate prmary workload to the core., on the other hand, has a tendency to assgn prmary workloads to the HP core, and n general, t lags behnd. comes very close to the performance of as tscale ncreases. Impact of pscale. Fgure 7c shows the mpact of pscale on the performance of the parttonng algorthms. When the core s very power-effcent,.e., pscale s low, and come very close to optmal scheme. Ths s because at the fxed % system load, assgns most of the prmary workload on the core, and that helps savng energy. As pscale grows, drfts away from the optmal scheme the most, because t s no longer effcent to use the core for most of the prmary workload. However, can stll perform wthn 5% of the optmal scheme, because t produces a more balanced parttonng wth a bas to allocate the prmary tasks to the core., whch produces a balanced parttonng wth a bas to assgn the prmary tasks to the HP core, performs poorly for low pscale, but starts to outperform for pscale greater than 0.4 and comes 2% of the optmal scheme.

7 SSA -SSA -BC -BC -MO -MO -Bound 40 Utlzaton (%) TSCALE -BC -SSA -BC -SSA -MO -MO -Bound PSCALE -BC -SSA -MO -SSA -BC -MO -Bound (a) Impact of utlzaton (b) Impact of tscale Fg. 8: Performance of the speed assgnment algorthms (c) Impact of pscale B. Evaluaton of Speed Assgnment Algorthms We mplemented the followng speed assgnment polces. Statc Speed Assgnment (SSA) ynamc Backup Cancellaton (BC) ynamc Backup Cancellaton wth Mnmum Overlap (MO) Bound The Bound algorthm s mplemented as a yardstck speed assgnment algorthm. After parttonng the tasks the executons slots are stll reserved for backup tasks those slots are dynamcally released (as n BC), but no extra energy consumpton s recorded for the overlapped executon of the backup tasks at run-tme. Snce the backup executons essentally ncur zero energy cost, no speed assgnment algorthm can outperform Bound. We matched Bound wth the exhaustve search based Optmal parttonng algorthm, obtanng a combned scheme denoted by -Bound, whch gves the lower bound on the performance of any realstc MPB algorthm. Gven the large number of parttonng/speed assgnment scheme combnatons, for other schemes, we are showng only the results we obtaned wth the best performng parttonng algorthms, namely and. We are usng the Overlap-Aware speed assgnment scheme for, as t s shown to be the best performng scheme for standby-sparng n [10]. Impact of Utlzaton. In Fgure 8a, we see that both - MO and -MO perform wthn 2% of Opt-Bound. Ths s because dynamcally reclamng the capacty for backup tasks and mnmzng overlap whle applyng VFS s a very effectve strategy, as done wthn MO. Ths s also true for at low-load, because t allows some carefully calculated overlapped executon. As the load ncreases, drfts away from Opt-Bound the most, because t has the restrcton that t cannot allocate prmary and backup copes on the same processor. -BC and -BC perform poorly for moderately loaded systems due to the large overlapped executons that t creates. However, for heavy load, backup copes need to run untl deadlne anyway, therefore the performance of the BC scheme mproves. Both -SSA and -SSA offer decent performance levels unless the load s very hgh. Impact of tscale. As we change tscale value (when the load s fxed at %), -MO performs the best and stays wthn 3% of Opt-Bound (Fgure 8b). The next best performng scheme s -MO. Ths agan suggests the superorty of MO thanks to ts dynamc but moderately aggressve approach n applyng VFS whle avodng overlaps. The plot also shows that -BC performs the worst, and -BC performs the worst among all the algorthms. Ths s because BC aggressvely slows down a task wthout regard to the overlapped executon. Impact of pscale. Varyng pscale yelds smlar trends (Fgure 8c). -MO and -MO perform the best, wthn 3% of Opt-Bound, by explotng the overlap avodance strategy of MO. -MO s performance, however, decreases as the core becomes less power-effcent (pscale ncreases). Ths s because, wth less power-effcent core, t s no longer favorable to assgn prmary workload to the core up to a threshold. ue to the aggressve frequency scalng of the BC scheme, -BC performs the worst throughout the entre spectrum. -BC, performng poorly for low pscale, starts to mprove when pscale s greater than 0.4, and comes wthn 1% of Bound. Ths s because when the core s less powereffcent, slowng t down as much as possble proves helpful from the energy consumpton perspectve. C. Addtonal Results Impact of the threshold value n the algorthm. The Fxed-Threshold () algorthm works by allocatng all prmary tasks to the core untl a threshold utlzaton s reached. Fgure 9a shows the mpact of the threshold value on a system that s % loaded and wth MO polcy. The results ndcate that the energy consumpton of decreases as we ncrease the threshold value, and at about 0.45, t outperforms the otherwse best performng algorthm,

8 Threshold (a) Utlzaton = % Threshold (b) Utlzaton = % Fg. 9: Impact of threshold value n algorthm 0.6 -BC -SSA -MO -SSA -BC -MO -Bound Max. speed of the core (a) Impact of the maxmum speed of the core Number of tasks (b) Impact of the number of tasks Fg. 10: Addtonal Evaluatons. Its energy consumpton s mnmzed at some threshold value around 0.. The energy consumpton goes up as we ncrease the threshold and becomes constant at some pont because when the threshold value exceeds the utlzaton, all of the workload s assgned to the core. The thresholdndependent algorthms, naturally yeld a constant energy consumpton. Fgure 9b shows a smlar pattern for a system wth % load. The results suggest that choosng a threshold value n the range [0.5, 0.6] s generally a very good choce when usng the algorthm. Impact of the maxmum speed of the core. In ths set of experments, we vared the maxmum speed of the core whle fxng the load at % for each confguraton (Fgure 10a). The performance of -MO remans wthn 5% of Opt-Bound for the entre regon, suggestng that t s applcable n a wde range of heterogeneous systems. algorthms, on the other hand, tend to drft away from Opt- Bound as the maxmum speed of the two processng cores become close to each other. We also see that the energy consumpton of all schemes ncreases wth ncreasng fmax. Ths s because, when the utlzaton s kept fxed at %, when we ncrease fmax, the effectve amount of workload on the system s ncreased, whch s reflected n the results. Impact of the number of tasks. Fgure 10b shows the mpact of number of tasks for a system wth utlzaton %. We see that for small number of tasks, the performance of all the schemes s affected. As the number of tasks grows, the average task sze decreases and the performances of varous schemes stablze. We can see that performs wthn 3% of the optmal scheme, and s about 3% worse than, for the entre regon. performs worse than the other two, but t also shows stable performance when the number of tasks grows. For the optmal scheme, we could only calculate energy consumptons for up to 17 tasks due to ts prohbtve computatonal complexty. V. CONCLUSION In ths paper, we proposed a fault-tolerant framework mplemented on heterogeneous dual-core systems, and proposed technques that can keep energy consumpton at a mnmum level. We devsed task parttonng algorthms along wth runtme frequency assgnment polces whle takng nto account the dfferent executon-tme and power-parameter scalng factors for applcaton tasks on heterogeneous cores. Our smulaton experments show that our proposed schemes perform very close to the theoretcal lower bound. ACKNOWLEGMENTS Ths work was supported by the US Natonal Scence Foundaton Awards CNS and CNS REFERENCES [1] ARM bg.little Technology. processors/technologes/bglttleprocessng.php. [2] T. S. Muthukaruppan, M. Prcop, V. Venkataraman, T. Mtra, and S. Vshn, Herarchcal power management for asymmetrc mult-core n dark slcon era, n Proc. of ACM/IEEE AC, [3] G. Sngla, G. Kaur, A. K. Unver, and U. Y. Ogras, Predctve dynamc thermal and power management for heterogeneous moble platforms, n Proc. of IEEE ATE, [4] I. Koren and C. M. Krshna, Fault-tolerant systems. Morgan Kaufmann, [5] H. Aydn, R. Melhem, and. Mossé, Toleratng faults whle maxmzng reward, n Proc. of IEEE ECRTS, [6] B. Zhao, H. Aydn, and. Zhu, Shared recovery for energy effcency and relablty enhancements n real-tme applcatons wth precedence constrants, ACM Transactons on esgn Automaton of Electronc Systems (TOAES), vol. 18, no. 2, p. 23, [7]. Ernst, S. as, S. Lee,. Blaauw, T. Austn, T. Mudge, N. S. Km, and K. Flautner, Razor: crcut-level correcton of tmng errors for low-power operaton, IEEE Mcro, vol. 24, no. 6, pp , [8] A. Ejlal, B. M. Al-Hashm, and P. Eles, Low-energy standby-sparng for hard real-tme systems, IEEE Trans. on Computer-Aded esgn of Integrated Crcuts and Systems, vol. 31, no. 3, pp , [9] M. A. Haque, H. Aydn, and. Zhu, Energy-aware standby-sparng technque for perodc real-tme applcatons, n Proc. of IEEE ICC, [10] A. Roy, H. Aydn, and. Zhu, Energy-aware standby-sparng on heterogeneous multcore systems, n Proc. of IEEE/ACM AC, [11] M. Prcop, T. S. Muthukaruppan, V. Venkataraman, T. Mtra, and S. Vshn, Power-performance modelng on asymmetrc mult-cores, n Proc. of IEEE CASES, [12]. Zhu, R. Melhem, and. Mossé, The effects of energy management on relablty n real-tme embedded systems, n Proc. of IEEE ICCA, [13] P. Emberson, R. Stafford, and R. I. avs, Technques for the synthess of multprocessor tasksets, n Proc. of the Int. WS on Analyss Tools and Methodologes for Embedded and Real-tme Systems (WATERS), 2010.

Energy-Aware Standby-Sparing on Heterogeneous Multicore Systems

Energy-Aware Standby-Sparing on Heterogeneous Multicore Systems Energy-Aware Standby-Sparng on Heterogeneous Multcore Systems ABSTRACT Abhshek Roy, Hakan Aydn epartment of Computer Scence George Mason Unversty Farfax, Vrgna 220 aroy6@gmu.edu, aydn@cs.gmu.edu Standby-sparng