Predictable Execution Model: Concept and Implementation

Size: px

Start display at page:

Download "Predictable Execution Model: Concept and Implementation"

Cora Carson
5 years ago
Views:

1 Predctable Executon Model: Concept and Implementaton Rodolfo Pellzzon, Emlano Bett, Stanley Bak, Gang Yao, John Crswell and Marco Caccamo Unversty of Illnos at Urbana-Champagn, IL, USA, {rpellz2, ebett, sbak2, crswell, Scuola Superore Sant Anna, Italy, Abstract Buldng safety-crtcal real-tme systems out of nexpensve, non-real-tme, COTS components s challengng. Although COTS components generally offer hgh performance, they can occasonally ncur sgnfcant tmng delays. To prevent ths, we propose controllng the operatng pont of each COTS shared resource (lke the cache, memory, and nterconnecton buses) to mantan t below ts saturaton lmt. Ths s necessary because the low-level arbters of these shared resources are not typcally desgned to provde real-tme guarantees. In ths work, we ntroduce a novel system executon model, the PRedctable Executon Model (PREM), whch, n contrast to the standard COTS executon model, coschedules at a hgh level all actve COTS components n the system, such as CPU cores and I/O perpherals. In order to permt predctable, system-wde executon, we argue that real-tme embedded applcatons need to be compled accordng to a new set of rules dctated by PREM. To expermentally valdate our theory, we developed a COTS-based PREM testbed and modfed the LLVM Compler Infrastructure to produce PREM-compatble executables. 1. Introducton Real-tme embedded systems are ncreasngly beng bult usng commercal-off-the-shelf (COTS) components such as mass-produced CPUs, perpherals and buses. Overall performance of mass produced components s often sgnfcantly hgher than custom-made systems. For example, a PCI Express bus [14] can transfer data three orders of magntude faster than the real-tme SAFEbus [8]. However, the man drawback of usng COTS components wthn a realtme system s the presence of unpredctable tmng anomales snce the ndvdual components are typcally desgned payng lttle or no attenton to worst-case tmng behavor. Addtonally, modern COTS-based embedded systems nclude multple actve components (such as CPU cores and I/O perpherals) that can ndependently ntate access to shared resources, whch, n the worst case, cause contenton leadng to tmng degradaton. Computng precse bounds on tmng delays due to contenton s dffcult. Even though some exstng approaches can produce safe upper bounds, they need to be very pessmstc due to the unpredctable behavor of arbters of physcally shared COTS resources (lke caches, memores, and buses). As a motvatng example, we have prevously shown that the computaton tme of a task can ncrease lnearly wth the number of suffered cache msses due to contenton for access to man memory [16]. In a system wth three actve components, a task s worst case computaton tme can nearly trple. To explot the hgh average performance of COTS components wthout experencng the long delays occasonally suffered by real-tme tasks, we need to control the operatng pont of each COTS shared resource and mantan t below saturaton lmts. Ths s necessary because the low-level arbters of the shared resources are not typcally desgned to provde real-tme guarantees. Ths work ams at showng that ths s ndeed possble by carefully rethnkng the executon model of real-tme tasks and by enforcng a hgh-level coschedulng mechansm among all actve COTS components n the system. Brefly, the key dea s to coschedule actve components so that contenton for accessng COTS shared resources s mplctly resolved by the hgh-level coscheduler wthout relyng on low-level, non-real-tme arbters. Several challenges had to be overcome to realze the PRedctable Executon Model (PREM): Task executon tmes suffer hgh varance due to nternal CPU archtecture features (caches, ppelnes, etc.) and unknown cache mss patterns. Ths source of temporal unpredctablty forces the desgner to make very pessmstc assumptons when performng schedulablty analyss. To address ths problem, PREM uses a novel program executon model wth three man features: (1) jobs are dvded nto a sequence of nonpreemptve schedulng ntervals; (2) some of these schedulng ntervals (named predctable ntervals) are executed predctably and wthout cache-msses by

2 prefetchng all requred data at the begnnng of the nterval tself; (3) the executon tme of predctable ntervals s kept constant by montorng CPU tme counters at run-tme. I/O perpherals wth DMA master capabltes contend for physcally shared resources, ncludng memory and buses, n an unpredctable manner. To address ths problem, we expand upon on our prevous work [1] and ntroduce hardware to put the COTS I/O subsystem under the dscplne of real-tme schedulng. Low-level COTS arbters are usually desgned to acheve farness nstead of real-tme performance. To address ths problem, we enforce a coschedulng mechansm that seralzes arbtraton requests of actve components (CPU cores and I/O perpherals). Durng the executon of a task s predctable nterval, a scheduled perpheral can access the bus and memory wthout experencng delays due to cache msses caused by the task s executon. Our PRedctable Executon Model (PREM) can be used wth a hgh level programmng language lke C by settng some programmng gudelnes and by usng a modfed compler to generate predctable executables. The programmer provdes some nformaton, lke begnnng and end of each predctable executon nterval, and the compler generates programs whch perform cache prefetchng and enforce a constant executon tme n each predctable nterval. In lght of the above dscusson, we argue that realtme embedded applcatons should be compled accordng to a new set of rules dctated by PREM. At the prce of mnor addtonal work by the programmer, the generated executable becomes far more predctable than state-of-the-art compled code, and when run wth the rest of the PREM system, shows sgnfcantly reduced worst-case executon tme. The rest of the paper s organzed as follows. Secton 2 dscusses related work. In Secton 3 we descrbe our man contrbuton: a co-schedulng mechansm that schedules I/O nterrupt handlers, task memory accesses and I/O perpheral data transfers n such a way that access to shared COTS resources s seralzed achevng zero or neglgble contenton durng memory accesses. Then, n Sectons 4 and 5 we dscuss the challenges n term of hardware archtecture and code organzaton that must be met to predctably comple real-tme tasks. Secton 6 presents our schedulablty analyss. Fnally, n Secton 7 we detal our prototype testbed, ncludng our compler mplementaton based on the LLVM Compler Infrastructure [9], and provde an expermental evaluaton. We conclude wth future work n Secton Related Work Several solutons have been proposed n pror real-tme research to address dfferent sources of unpredctablty n COTS components, ncludng real-tme handlng of perpheral drvers, real-tme complaton, and analyss of contenton for memory and buses. For perpheral drvers, Facchnett et al. [4] proposed usng a non-preemptve nterrupt server to better support the reusng of legacy drvers. Addtonally, analyss can be done to model worst-case temporal nterference caused by devce drvers [10]. For real-tme complaton, a tght couplng between compler and worst-case executon tme (WCET) analyzer can optmze a program s WCET [5]. Alternatvely, a compler-based approach can provde predctable pagng [17]. For analyss of contenton for memory and buses, exstng technques can analyze the maxmum delay caused by contenton for a shared memory or bus under varous access models [15, 20]. All these works attempt to analyze or control a sngle resource, and obtan safe bounds that are often hghly pessmstc. Instead, PREM s based on a global coschedule of all relevant system resources. Instead of usng COTS components, other researchers have dscussed new archtectural solutons that can greatly ncrease system predctablty by removng sgnfcant sources of nterference. Instead of a standard cache-based archtecture, a real-tme scratchpad archtecture can be used to provde predctable access tme to man memory [22]. The Precson Tme (PRET) machne [3] promses to smultaneously delver hgh computatonal performance together wth cycle-accurate estmaton of program executon tme. Whle our PREM executon model borrows some deas from these works, t exhbts one key dfference: our model can be appled to exstng COTS-based systems, wthout requrng sgnfcant archtectural redesgn. Ths approach allows PREM to leverage the advantage of the economy of scale of COTS systems, and support the progressve mgraton of legacy systems. 3. System Model We consder a typcal COTS-based real-tme embedded system comprsng of a CPU, man memory and multple DMA perpherals. Whle n ths paper we restrct our dscusson to sngle-core systems wth no hardware multthreadng, we beleve that our predctable executon model s also applcable to multcore systems. We wll present a predctable executon model for multcore systems as part of our planned future work. The CPU can mplement one or more cache levels. We focus on the last cache level, whch typcally employs a wrte-back polcy. Whenever a task suf-

COTS Perpheral COTS Perpheral COTS Perpheral Real-Tme Brdge Real-Tme Brdge Perpheral Scheduler Real-Tme Brdge PCIe PCI RAM North Brdge South Brdge COTS Motherboard FSB ATA CPU Dsk <8=#$%&'()*"#

3 COTS Perpheral COTS Perpheral COTS Perpheral Real-Tme Brdge Real-Tme Brdge Perpheral Scheduler Real-Tme Brdge PCIe PCI RAM North Brdge South Brdge COTS Motherboard FSB ATA CPU Dsk <8=#$%&'()*"# <:'>&#.&3'>&2#:",# /&?-:'&4&"32# 8&/9?>&/:-#,:3:# 3/:"2.&/2# 4&4*/1##?>:2&# &%&'()*"#?>:2&# Fgure 1: Real-Tme I/O Management System. fers a cache mss n the last level, the cache controller must access man memory to fetch the newly referenced cache lne and possbly wrte-back a replaced cache lne. Perpherals are connected to the system through COTS nterconnect such as PCI or PCIe [14]. DMA perpherals can autonomously ntate data transfers on the nterconnect. We assume that all data transfers target man memory, that s, data s always transferred between the perpheral s nternal buffers and man memory. Therefore, we can treat man memory as a sngle resource shared by all perpherals and by the cache controller. The CPU executes a set of N real-tme perodc tasks Γ = {τ 1,..., τ N }. Each task can use one or more perpherals to transfer nput or output data to or from man memory. We model all perpheral actvtes as a set of M perodc I/O flows Γ I/O = {τ I/O 1,..., τ I/O M } wth assgned tmng reservatons, and we want to schedule them n such a way that only one flow s transferred at a tme. Unfortunately, COTS perpherals do not typcally conform to the descrbed model. As an example, consder a task recevng nput data from a Network Interface Card (NIC). Delays n the network could easly cause a burst of packets to arrve at the NIC. Snce a hgh-performance COTS NIC s desgned to autonomously transfer ncomng packets to man memory as soon as possble, the NIC could potentally requre memory access for sgnfcantly longer than ts expected perodc reservaton. In [1], we frst ntroduced a soluton to ths problem consstng of a real-tme I/O management scheme. A dagram of our proposed archtecture s depcted n Fgure 1. A mnmally ntrusve hardware devce, called real-tme brdge, s nterposed between each perpheral and the rest of the system. The real-tme brdge buffers all ncomng traffc from the perpheral and delvers t predctably to man memory accordng to a global I/O schedule. Outgong traffc s also retreved from man memory n a predctable fashon. To maxmze responsveness and avod CPU overhead, the I/O schedule s computed by a separate perpheral scheduler, a hardware devce based on the prevously-developed [1] reservaton controller, whch controls all real-tme brdges. For smplcty, n the rest of our dscusson we assume that each perpheral servces a sngle task. However, ths assumpton can be lfted by support- Fgure 2: Predctable Interval wth constant executon tme.!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# ng perpheral vrtualzaton n real-tme brdges. More detals about our hardware/software mplementaton are provded n Secton 7. Notce that our prevously-developed I/O management system [1] does not solve the problem of memory nterference between perpherals and CPU tasks. When a typcal real-tme task s executed on a COTS CPU, cache msses are unpredctable, makng t dffcult to avod low-level contenton for access to man memory. To overcome ths ssue, we propose a set of compler and OS technques that enable us to predctably schedule all cache msses durng a gven porton of a task executon. The code for each task τ s dvded nto a set of N schedulng ntervals {s,1,..., s,n }, whch are executed sequentally at runtme. The tmng requrements of τ can be expressed by a tuple {{e,1,..., e,n }, p, D }, where p, D are the perod and relatve deadlne of the task, wth D p, and e,j s the maxmum executon tme of s,j, assumng that the nterval runs n solaton wth no memory nterference. A job can only be preempted by a hgher prorty job at the end of a schedulng nterval. Ths ensures that the cache content can not be altered by the preemptng job durng the executon of an nterval. We classfy the schedulng ntervals nto compatble ntervals and predctable ntervals. Compatble ntervals are compled and executed wthout any specal provsons (they are backwards compatble). Cache msses can happen at any tme durng these ntervals. The task code s allowed to perform OS system calls, but blockng calls must have bounded blockng tme. Furthermore, the task can be preempted by nterrupt handlers of assocated perpherals. We assume that the maxmum executon tme e,j for a compatble nterval can be computed based on tradtonal statc analyss technques. However, to reduce the pessmsm n the analyss, we prohbt perpheral traffc from beng transmtted durng a compatble nterval. Ideally, there should be a small number of compatble ntervals whch are kept as short as possble. Predctable ntervals are specally compled to execute accordng to the PREM model shown n Fgure 2, and exhbt three man propertes. Frst, each predctable nterval s dvded nto two dfferent phases. Durng the ntal memory phase, the CPU accesses man memory to perform a set of

s1,1 s1,2 s1,3 τ1 D#'*4C:);-&# ##9"3&/7:-# s2,1 s2,2 s2,3 s2,4 τ2 D#4&4*/1# ##CE:2&# 9"C(3##,:3:# I/O τ1 *(3C(3##,:3:# D#&%&'()*"# ##CE:2&# I/O D#FGH#I*J# τ2 <# =<# ><#?

At the end of the memeral drver (ncludng nterrupt handlers) and set up the reory phase, all cache lnes requred durng the predctable ncepton and transmsson buffers n man memory (.e. read terval are avalable n last level cache.

I/O flows can be scheduled both durng executon task performs useful computaton wthout sufferng any last phases and whle the CPU s dle. As we wll show n Seclevel cache msses.

4 s1,1 s1,2 s1,3 τ1 D#'*4C:);-&# ##9"3&/7:-# s2,1 s2,2 s2,3 s2,4 τ2 D#4&4*/1# ##CE:2&# 9"C(3##,:3:# I/O τ1 *(3C(3##,:3:# D#&%&'()*"# ##CE:2&# I/O D#FGH#I*J# τ2 <# =<# A<# B<# Fgure 3: Example System-Level Predctable Schedule These ntervals are needed to execute the assocated perphcache lne fetches and replacements. At the end of the memeral drver (ncludng nterrupt handlers) and set up the reory phase, all cache lnes requred durng the predctable ncepton and transmsson buffers n man memory (.e. read terval are avalable n last level cache. Second, the second and wrte system calls). More detals are provded n Secphase s know as the executon phase. Durng ths phase, the!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# ton 7. I/O flows can be scheduled both durng executon task performs useful computaton wthout sufferng any last phases and whle the CPU s dle. As we wll show n Seclevel cache msses. Predctable ntervals do not contan any ton 6, the descrbed scheme can be modeled as a herarsystem calls and can not be preempted by nterrupt handlers. chcal schedulng system [21], where the CPU schedule of Hence, the CPU does not perform any external man mempredctable ntervals supples avalable transmsson tme to ory access durng the executon phase. Due to ths propi/o flows. Therefore, exstng tests can be reused to check erty, perpheral traffc can be scheduled durng the executhe schedulablty of I/O flows. However, due to the characton phase of a predctable nterval wthout causng any conterstcs of predctable ntervals, a more complex analyss s tenton for access to man memory. Thrd, at run-tme, we requred to derve the supply functon. force the executon tme of a predctable nterval to be always equal to e,j. Let emem be the maxmum tme requred,j to complete the memory phase and eexec,j to complete the ex4. Archtectural Constrants and Solutons ecuton phase. Then offlne we set e,j = emem + eexec,j,j and at run-tme, even f the memory phase lasts for less than Predctable ntervals are executed n a radcally dfferemem tme unts, the overall nterval stll completes n exent way compared to the speculatve executon model that,j actly e,j. Ths property greatly ncreases task predctablty COTS components are typcally desgned to support. In ths wthout affectng CPU worst-case guarantees. In partcular, secton, we detal the challenges and solutons to mplement as we show n Secton 6, t ensures that hard real-tme guarthe PREM executon model on top of a COTS archtecture. antees can be extended to I/O flows. Fgure 3 shows a concrete example of a system-level pre4.1. Cachng and Prefetch dctable schedule for a task set comprsng two tasks τ1, τ2 I/O I/O together wth two I/O flows τ1, τ2 whch servce τ1 and Our general strategy to mplement the memory phase τ2 respectvely. Both tasks and I/O flows are scheduled acconssts of two steps: (1) we determne the complete set of cordng to fxed prorty, wth τ1 havng hgher prorty than memory regons that are accessed durng the nterval. Each I/O I/O regon s a contnuous area n vrtual memory. In general, ts τ2 and τ1 hgher prorty than τ2. We set D = p start address can only be determned at run-tme, but ts sze and assgn to each I/O flow the same perod and deadlne A s known at comple tme. (2) Durng the memory phase, as ts servced task and a transmsson tme equal to 4 tme we prefetch all cache lnes that contan nstructons and data unts. As shown n Fgure 3 for task τ1, ths means that the for requred regons; most COTS nstructon sets nclude a nput data for a gven job s transmtted n the perod beprefetch nstructon that can be used to load specfc cache fore the job s executed, and the output data s transmtted lnes n last level cache. Step (1) wll be detaled n Secton n the perod after. Task τ1 has a sngle predctable nter5. Step (2) can be successful only f there s no cache selfval of length e1,2 = 4 whle τ2 has two predctable nterevcton, that s, prefetchng a cache lne never evcts anvals of lengths e2,2 = 4 and e2,3 = 3. The frst and last other lne that has already been accessed (prefetched) durnterval of both τ1 and τ2 are specal compatble ntervals.

':'A&#B:1# ':'A&# -9"&# 79/3(:-#4&4#2<:'&# <A129':-#4&4#2<:'&# -9"&# 9",&%# C# <:=&#># <:=&#?# <:=&#@# Fgure 4: Cache organzaton wth one memory regon!"#$%&'()*"#+*,&-#.

Let B be the total sze of the cache and L be the sze of each cache lne n bytes. Then the byte sze of each of the N cache ways s W = B/N.

Last level cache s typcally physcally tagged and physcally ndexed, meanng that cache lnes are accessed based on physcal memory addresses only.

Fgure 4 shows an example where L = 4, W = 16 (parameters are chosen to smplfy the dscusson and are not representatve of typcal systems).

5 ':'A&#B:1# ':'A&# -9"&# 79/3(:-#4&4#2<:'&# <A129':-#4&4#2<:'&# -9"&# 9",&%# C# <:=&#># <:=&#?# Fgure 4: Cache organzaton wth one memory regon!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# ng the same memory phase. In the remander of ths secton, we descrbe self-evcton preventon. Most COTS CPUs mplement the last-level cache as an N-way set assocatve cache. Let B be the total sze of the cache and L be the sze of each cache lne n bytes. Then the byte sze of each of the N cache ways s W = B/N. An assocatve set s the set of all cache lnes, one for each way, whch have the same ndex n cache; there are W/L assocatve sets. Last level cache s typcally physcally tagged and physcally ndexed, meanng that cache lnes are accessed based on physcal memory addresses only. We also assume that last level cache s not exclusve, that s, when a cache lne s coped to a hgher cache level t s not removed from the last level. Fgure 4 shows an example where L = 4, W = 16 (parameters are chosen to smplfy the dscusson and are not representatve of typcal systems). The man dea behnd our conflct analyss s as follows: we compute the maxmum amount of entres n each assocatve set that are requred to hold the cache lnes prefetched for all memory regons n a schedulng nterval. Based on the cache replacement polcy, we then derve a safe lower bound on the amount of entres that can be prefetched n an assocatve set wthout causng any self-evcton. Consder a memory regon of sze A. The regon wll occupy at most K = A 1 L + 1 cache lnes. As shown n Fgure 4 for a regon wth A = 15, K = 5, the worst case s produced when the regon uses a sngle byte n ts frst cache lne. Assume now that vrtual memory addresses concde wth physcal addresses. Then snce the regon s contguous and there are W/L cache lnes n each way, the maxmum number of entres used n any assocatve set by the regon s K W/L. For example, the regon n Fgure 4 requres two entres n the set wth ndex 0. We then derve the maxmum number of entres for the entre nterval by summng the entres requred for each memory regon. Unfortunately, ths s not generally true f the system employs paged vrtual memory. If the page sze P s smaller than the C# C# ># sze W of each way, the ndex of each cache lne nsde the cache way s dfferent for vrtual and physcal addresses. In the example of Fgure 4 wth P = 8, the number of entres for the memory regon s ncreased from 2 to 3. We consder two solutons: 1) f the system supports t, we can select a page sze multple of W just for our specfc process. Ths soluton, whch we employed n our mplementaton, solves the problem because the ndex n cache for vrtual and physcal addresses s the same no matter the page allocaton. 2) We use a modfed page allocaton algorthm n the OS. Cache-aware allocaton algorthms have been proposed n the lterature, for example n [11] for cache parttonng. Note that a sutable allocaton algorthm could decrease the requred number of assocatve entres by controllng the allocaton n physcal memory of multple regons. We plan to pursue ths soluton n our future work. We now consder the cache replacement polcy. We consder four dfferent replacement polces: random, FIFO, LRU and pseudo-lru, whch cover most mplemented cache archtectures. Let Q be the maxmum number of entres n any assocatve set requred by the predctable nterval. Furthermore, let Q be the number of such entres relatve to cache lnes that are accessed multple tmes durng the memory phase; n our mplementaton, ths only ncludes the cache lnes that contan the small amount of nstructons of the memory phase tself, so Q = 1. A detaled analyss of the four replacement polces, based on the work n [7, 19], s provded n Appendx A; n partcular, based on the replacement polcy and possbly Q, we show how to compute a lower bound on Q such that no selfevcton can happen f Q s less than or equal to the derved bound. It s mportant to notce that the bound for some polces depends on the state of the cache before the begnnng of the predctable nterval. Therefore, we consder two dfferent cache nvaldaton models. In the full-nvaldaton model, none of the cache lnes used by a predctable nterval are already avalable n cache at the begnnng of the memory phase. Furthermore, the replacement polcy s not affected by the presence of nvaldated cache lnes. As an example, ths model can be enforced by fully nvaldatng the cache ether at the end or at the begnnng of each predctable nterval, whch n most archtectures has the sde effect of resettng all replacement queues/trees. Whle ths model s more predctable, several COTS CPUs do not support t, ncludng our testbed descrbed n Secton 7. In the partal-nvaldaton model, selected lnes n last level cache can be nvaldated, for example to handle data coherency between the CPU and perpherals n compatble ntervals. Furthermore, the replacement polcy always selects nvaldated cache lnes before vald lnes, ndependently of the state of the queue/tree. Results n Appendx A are summarzed by the followng theorem. Theorem 1. If cache lnes are prefetched n a consstent or-

6 der, then at the end of the memory phase all Q cache lnes n an assocatve set wll be avalable n cache, requrng at most Q fetches to be loaded, f Q s at most equal to: 1 for random; N for FIFO and LRU; log 2 N + 1 for pseudo-lru wth partal-nvaldaton semantc; { N + 2 Q Q f Q < log 2 N; log 2 N + 1 otherwse, for pseudo-lru wth full-nvaldaton semantc Computng Phase Length To provde predctable tmng guarantees, the maxmum tme requred for memory phase e mem,j and for executon phase e exec,j must be computed. We assume that an upper bound to e exec,j can be derved based on statc analyss of the executon phase. Note that our model only ensures that all requred nstructons and data are avalable n last level cache; cache msses can stll occur n hgher cache levels. Analyses that derve patterns of cache msses for splt level 1 data and nstructon caches have been proposed n [12,18]. These analyses can be safely executed because accordng to our model, level 1 msses durng the executon phase wll not requre the CPU to perform any external access. e mem,j depends on the tme requred to prefetch all accessed memory regons. Note that snce the task can be preempted at the boundary of schedulng ntervals, the cache state s unknown at the begnnng of the memory phase. Hence, each prefetch could requre both a cache lne fetch and a replacement n last cache level. An analyss to compute upper bounds for read/wrte operatons usng a COTS DRAM memory controller s detaled n [13]. Note that snce the number and relatve addresses of prefetched cache lnes s known, the analyss could potentally explot features such as burst read and parallel bank access that are common n modern memory controllers. Fnally, for systems mplementng paged vrtual memory, we employ the followng three assumptons: 1) the CPU supports hardware Translaton Lookasde Buffer (TLB) management; 2) all pages used by predctable ntervals are locked n man memory; 3) the TLB s large enough to contan all page entres for a predctable nterval wthout sufferng any conflct. Under such assumptons, each page used n a predctable nterval can cause at most one TLB mss durng the memory phase, whch requres a number of fetches n man memory equal to at most the level of the page table Interval Length Enforcement As descrbed n Secton 3, each predctable nterval s requred to execute for exactly e,j tme unts. If e,j s set to be at least e mem,j + e exec,j, the computaton n the executon phase s guaranteed to termnate at or before the end of the nterval. The nterval can then enter actve wat untl e,j tme unts have elapsed snce the begnnng of ts memory phase. In our mplementaton, elapsed tme s computed usng a CPU performance counter that s drectly accessble by the task; therefore, nether OS nor memory nteracton are requred to enforce nterval length Schedulng synchronzaton In our model, perpherals are only allowed to transmt durng a predctable nterval s executon phase or whle the CPU s dle. To compute the perpheral schedule, the perpheral scheduler must thus know the status of the CPU schedule. Synchronzaton can be acheved by connectng the perpheral scheduler to a perpheral nterconnecton as shown n Fgure 1. Schedulng messages can then be sent by ether a task or the OS to the perpheral scheduler. In partcular, at the end of each memory phase the task sends to the perpheral scheduler the remanng amount of tme untl the end of the current predctable nterval. Note that propagatng a message through the nterconnecton takes non-zero tme. Snce ths tme s typcally neglgble compared to the sze of a schedulng nterval 1, we wll gnore t n the rest of our dscusson. However, the schedulablty analyss of Secton 6 could be easly modfed to take such overhead nto account. Fnally, to avod executng nterrupt handlers durng predctable ntervals, a perpheral should only rase nterrupts to the CPU durng compatble ntervals of ts servced task. As we descrbe n Secton 7, n our I/O management scheme perpherals rase nterrupts through ther assgned real-tme brdge. Snce the perpheral scheduler communcates wth each real-tme brdge, t s used to block nterrupt propagaton outsde the desred compatble ntervals. Note that blockng real-tme brdge nterrupts to the CPU wll not cause any loss of nput data because the real-tme brdge s capable of ndependently acknowledgng the perpheral and storng all ncomng data n the brdge local buffer. 5. Programmng Model Our system supports COTS applcatons wrtten n standard hgh-level languages such as C. Unmodfed code can be executed wthn one or more compatble ntervals. To 1 In our mplementaton we measured an upper bound to the message propagaton tme of 1us, whle we envson schedulng ntervals wth a length of us.

7 create predctable ntervals, programmers add source code annotatons as C preprocessor macros. The PREM real-tme compler creates code for predctable ntervals so that t does not ncur cache msses durng the executon phase and the nterval tself has a constant executon tme. Due to current lmtatons of our statc code analyss, we assume that the programmer manually parttons the task nto ntervals. In partcular, compatble ntervals can be handled n the conventonal way whle each predctable nterval s encapsulated nto a sngle functon. All code wthn the functon, and any functons transtvely called by ths functon s executed as a sngle predctable nterval. To correctly prefetch all code and data used by a predctable nterval, we mpose several constrants upon ts code: 1. Only scalar and array-based memory accesses should occur wthn a predctable nterval; there should be no use of ponter-based data structures. 2. The code can use data structures, n partcular arrays, that are not local to functons n the predctable ntervals, e.g. they are allocated ether n global memory or n the heap 2. However, the programmer should specfy the frst and last address that s accessed wthn the predctable nterval for each non-local data structure. In general, t s dffcult for the compler to determne the frst and last access to an array wthn a pece of code. The compler needs ths nformaton to be able to load the relevant porton of each array nto the lastlevel cache. 3. The functons wthn a predctable nterval should not be called recursvely. As descrbed below, the compler wll nlne callees nto callers to make all the code wthn an nterval contguous n vrtual memory. Furthermore, no system calls should be made by code wthn a predctable nterval. 4. Code wthn a predctable nterval may only have drect calls. Ths allevates the needs for ponter-analyss to determne the targets of ndrect functon calls; such analyss s usually mprecse and would bloat the code wthn the nterval. 5. No stack allocatons should occur wthn loops. Snce all varables must be loaded nto the cache at functon entry, t must be possble for the compler to safely host the allocatons to the begnnng of the functon whch ntates the predctable nterval. Whle these constrants may seem restrctve, some of these features are rarely used n real-tme C code e.g., ndrect functon calls, and the others are met by many types of functons. We beleve that the beneft of faster, more 2 Note that data structures n the heap must have been prevously allocated durng a compatble nterval. predctable behavor for program hot-spots outweghs the restrctons mposed by our programmng model. Furthermore, exstng code that s too complex to be compled nto predctable ntervals can stll be executed nsde compatble ntervals. Therefore, our model permts a smooth transton for legacy systems. Notce that, the compler can be used to verfy that all of the aforementoned restrctons are met. Smple statc analyss can determne whether there s any rregular data structure usage, ndrect functon calls, or system calls. Durng complaton, the compler employs several transforms to ensure that code marked as beng wthn a predctable nterval does not cause a cache mss. Frst, the compler nlnes all functons called (ether drectly or transtvely) durng the nterval nto the top-level functon defnng the nterval. Ths ensures that all program analyss s ntra-procedural and that all the code for the nterval s contguous wthn vrtual memory. Second, the compler can transform the program so that all cache msses occur durng the memory phase, whch s located at the begnnng of the predctable schedulng nterval. To be specfc, t nserts code after the functon prologue to prefetch the code and data needed to execute the nterval. Based on the descrbed constrants, ths ncludes three types of contguous memory regons: (1) the code for the functon; (2) the actual parameters passed to the functon and the stack frame (whch contans local varables and regster spll slots); and (3) the data structures marked by the programmer as beng accessed by the nterval. Thrd, the compler nserts code to send schedulng messages to the perpheral scheduler as wll be descrbed n Secton 7. Fourth, the compler emts code at the end of the predctable nterval to enforce ts constant length. In partcular, the compler dentfes all return nstructons wthn the functon and adds the requred code before them. Fnally, based on the nformaton on prefetched memory regons, we assume that an external tool, such as a statc tmng analyzer used to compute maxmum phase length n Secton 4.2, can check the absence of cache self-evctons accordng to the analyss provded n Secton Schedulablty Analyss PREM allows us to enforce strct tmng guarantees for both CPU tasks and ther assocated I/O flows. By settng tmng parameters as shown n Fgure 3, the task schedule becomes ndependent of I/O schedulng. Therefore, task schedulablty can be checked usng avalable schedulablty tests. As an example, assume that tasks are scheduled accordng to fxed prorty schedulng as n Fgure 3. For a task τ, let e = N j=1 e,j be the sum of the executon tmes of ts schedulng ntervals, or equvalently, the executon tme of the whole task. Furthermore, let hp Γ be the set of hgher prorty tasks than τ, and lp the set of lower

8 prorty tasks. Snce schedulng ntervals are executed non preemptvely, τ can suffer a blockng tme due to lower prorty tasks of at most B = max τl lp max j=1...nl e l,j. The worst-case response tme of τ can then be found [2] as the fxed pont r of the teraton: r k+1 = e + B + l hp r k p l e l, (1) startng from r 0 = e +B. Task set Γ s schedulable f τ : r D. We now turn our attenton to perpheral schedulng. Assume that each I/O flow τ I/O s characterzed by a maxmum transmsson tme e I/O (wth no nterference n both man memory and the nterconnect), perod p I/O and relatve deadlne D I/O, where D I/O p I/O. The schedulablty analyss for I/O flows s more complex because the schedulng of data transfers depends on the task schedule. To solve ths ssue, we extend the herarchcal schedulng framework proposed by Shn and Lee n [21]. In ths framework, tasks/flows n a chld schedulng model execute usng a tmng resource provded by a parent schedulng model. Schedulablty for the chld model can be tested based on the supply bound functon sbf(t), whch represents the mnmum resource supply provded to the chld model n any nterval of tme t. In our model, the I/O flow schedule s the chld model and sbf(t) represents the mnmum amount of tme n any nterval of tme t durng whch the executon phase of a predctable nterval s scheduled or the CPU s dle. Defne the servce tme bound functon tbf(t) as the pseudo-nverse of sbf(t), that s, tbf(t) = mn{x sbf(x) t}. Then f I/O flows are scheduled accordng to fxed prorty, n [21] t s shown that the response tme r I/O can be computed accordng to the teraton: τ I/O r I/O,k+1 ( = tbf e I/O + l hp I/O I/O,k r p I/O l e I/O l of flow ), (2) where hp I/O has the same meanng as hp. In the remander of ths secton, we detal how to compute sbf(t). For the sake of smplcty, let us ntally assume that tasks are strctly perodc and that the ntal actvaton tme of each task s known. Furthermore, notce that usng the soluton descrbed n Secton 4, we could enforce nterval lengths not just for predctable ntervals, but also for all compatble ntervals. Fnally, let h be the hyperperod of task set Γ, defned as the least common multple of all tasks perods. Under these assumptons, t s easy to see that f Γ s feasble, the CPU schedule can be computed offlne and repeats tself dentcally wth perod h after an ntal nterval of 2h tme unts (h tme unts f all tasks are actvated smultenously). Therefore, a tght sbf(t) can be computed as the mnmum amount of supply (tme durng whch the CPU s dle or n executon phase of a predctable nterval) durng any nterval of tme t n the perodc task schedule, startng from the ntal two hyperperods. More formally, let {t 1,..., t K } be the set of start tmes for all schedulng ntervals actvated n the frst two hyperperods; the followng theorem shows how to compute sbf(t). Theorem 2. Let sf(t, t ) be the amount of supply provded n the perodc task schedule durng nterval [t, t ]. Then: sbf(t) = mn k=1...k sf(tk, t k + t). (3) Proof. Let t, t be two consecutve nterval start tmes n the perodc schedule. We show that t, t t t : mn ( sf(t, t + t), sf(t, t + t) ) sf(t, t + t). Ths mples that to mnmze sbf(t), t suffces to check the start tmes of all schedulng ntervals. Assume frst that t falls nsde a compatble nterval or a memory phase. Then the schedule provdes no supply n [t, t]. Therefore: sf(t, t + t) = sf(t, t + t) sf(t, t + t). Now assume that t falls n an executon phase or n an dle nterval before the start tme t of the next schedulng nterval. Then sf(t, t ) = t t. Therefore: sf(t, t + t) sf(t, t +t (t t))+(t t) = sf(t, t+t)+sf(t, t ) = sf(t, t + t). To conclude the proof, t suffces to note that snce the schedule s perodc, f t 2h, then sf(t, t + t) = sf(t k, t k + t) where t k = (t mod h) + h. Unfortunately, the proposed sbf(t) dervaton can only be appled to strctly perodc tasks. If Γ ncludes any sporadc task τ wth mnmum nterarrval tme p, the schedule can not be computed offlne. Therefore, we now propose an alternatve analyss that s ndependent of the CPU schedulng algorthm and computes a lower bound sbf L (t) to sbf(t). Note that snce sbf(t) s the mnmum supply n any nterval t, usng Equaton 2 wth sbf L (t) nstead of sbf(t) wll stll result n a suffcent schedulablty test. Let sbf(t) be the maxmum amount of tme n whch the CPU s executng ether a compatble nterval or the memory phase of a predctable nterval n any tme wndow of length t. Then by defnton, sbf(t) = t sbf(t). Our analyss derves an upper bound sbf U (t) to sbf(t). Therefore, sbf L (t) = t sbf U (t) s a vald lower bound for sbf(t). For a gven nterval sze t, we compute sbf(t) usng the followng key dea. For each task τ, we determne the mnmum and maxmum amount of tme E mn (t), E max (t) that the task can execute n a tme wndow of length t whle meetng ts deadlne constrant. Let t, E mn (t) t E max (t), be the amount of tme that τ actually executes n the tme wndow. Snce the CPU s a sngle shared resource, t must hold N =1 t t ndependently of the task schedulng algorthm. Fnally, for each task τ, we defne the memory bound functon mbf (t ) to be the maxmum amount of tme

9 $" #"!" E max (t) %" #" &" ''" '#" '(" %!" For a sporadc task, E mn (t) = 0, whle for a perodc task: ( t E mn (p + D 2e ) ) +e (t) = + (8) p ( mn(t (p + D 2e ) mod p, e ) ) +. mn(t, e ) e p D t e (p D ) +e p t = 20 mn(t e (p D ) mod p,e ) + Fgure 5: Dervaton of E max (t), perodc or sporadc task. $" #"!" p e %%" %#" %&" '!" t mn(t (p + D 2e ) mod p (p + D 2e ) +e +,e ) p D e t = 20 E mn (t) Fgure 6: Dervaton of E mn (t), perodc task. that the task can spend n compatble ntervals and memory phases, assumng that t executes for t tme unts n the tme wndow. We can then obtan sbf U (t) as the maxmum over all feasble {t 1,..., t N } tuples of N =1 mbf (t ). In other words, we can compute sbf U (t) by solvng the followng optmzaton problem: sbf U (t) = max mbf (t ), (4) =1 t t, (5) =1, 1 N : E mn (t) t E max (t). (6) E max (t) and E mn (t) can be computed accordng to the scenaros shown n Fgures 5, 6 where the notaton (x) + s used for max(x, 0). Theorem 3. For a perodc or sporadc task: ( t e (p D ) ) +e (t) = mn(t, e ) + + p ( mn(t e (p D ) mod p, e ) ) +. (7) E max Proof. As shown n Fgure 5, the relatve dstance between successve executons of a task τ s mnmzed when the task fnshes at ts deadlne n the frst perod wthn the tme wndow of length t, and as soon as possble (e.g., e tme unts after ts actvaton) for each successve perod. Equaton 7 drectly follows from ths actvaton pattern, assumng that the tme wndow concdes wth the start tme of the frst job of τ. In partcular: term mn(t, e ) represents the executon tme for the frst job, whch s actvated at the begnnng of the tme wndow; ( ) +e term t e (p D ) p represents the executon tme of jobs n perods that are fully contaned wthn the tme wndow; note that the frst perod starts e + (p D ) tme unts after the begnnng of the tme wndow; fnally, term ( mn(t e (p D ) mod p, e ) ) + represents the executon tme for the last job n the tme wndow. Smlarly, as shown n Fgure 6, the relatve dstance between successve executons of a perodc job τ s maxmzed when the task fnshes as soon as possble n ts frst perod and as late as possble n all successve perods. Equaton 8 follows assumng that the tme wndow concdes wth the fnshng tme of the frst job of τ, notcng that perodc task actvatons start (p e ) + (D e ) = p +D 2e tme unts after the begnnng of the tme wndow. Fnally, E mn (t) s zero for a sporadc task snce by defnton the task has no maxmum nterarrval tme. Lemma 4. The optmzaton problem of Equatons 4-6 admts soluton f task set Γ s feasble. Proof. The optmzaton problem admts soluton f the set of t values that satsfy Equatons 5 and 6 s not empty. Note that by defnton E max (t) E mn (t), therefore there s at least one admssble soluton ff N =1 Emn (t) t. Defne EU mn(t) = ( t (D e ) ) + e p. Then t s easy to see that t, E mn (t) EU mn (t): n partcular, both functons are equal to e for t = p + D e and ncrease by e every p afterwards. Now note that EU mn e (t) t p, therefore t also holds: N =1 Emn (t) t N e =1 p. The proof follows by notcng that snce Γ s feasble, t must hold N e =1 p 1.

E#'*4F:);-&# ##9"3&/7:-# E#4&4*/1# ##FG:2&# E#&%&'()*"# ##FG:2&# δ = 5/2 t1 t2 t3 α = 1/2 @A#?# mbfu (t ) 4-6 n a fnte number of ponts, and usng lnear nterpolaton to derve the remanng values.

mbf (t ) can be computed usng a strategy smlar to the!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# one n Theorem 2.

10 E#'*4F:);-&# ##9"3&/7:-# E#4&4*/1# ##FG:2&# E#&%&'()*"# ##FG:2&# δ = 5/2 t1 t2 t3 α = mbfu (t ) 4-6 n a fnte number of ponts, and usng lnear nterpolaton to derve the remanng values. Due to the complexty of the resultng algorthm, the full dervaton s provded n Appendx B. mbf (t ) ># =# <# B# t4 e,1 e,2 C#?# e,4 Fgure 7: Dervaton of mbf (t ) and mbf U (t ). mbf (t ) can be computed usng a strategy smlar to the!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# one n Theorem 2. Snce t represents the amount of tme that τ executes, we consder a schedule n whch τ s executed contnuously beng actvated every e tme unts, as shown n Fgure 7. The start tme of the frst schedulng nterval n the frst job of τ s t1 = 0, the start tme of the second schedulng nterval s t2 = e,1, and so on and so forth untl the frst nterval of the second job whch has start tme equal to e. Then mbf (t ) can be computed as the maxmum amount of tme that the task spends n a compatble nterval or memory phase n any tme wndow t. Theorem 5. Let mf (t0, t00 ) be the amount of tme that task τ spends n a compatble nterval or memory phase n the nterval [t0, t00 ], n the schedule n whch τ s executed contnuously. Then: mbf (t ) = max mf (tk, tk + t ). k=1...n (9) Proof. Note that the schedule n whch τ executes contnuously s perodc wth perod p. The same strategy as n Theorem 2 can be used to show that f t0, t00 are consecutve nterval start tmes, then t, t t t : max mf (t, t t ), mf (t, t + t ) mf (t, t + t ). As an example, n Fgure 7 mbf (t ) s maxmzed n the tme wndow that starts at t4 = 10. Snce mbf (t ) s a nonlnear functon, computng sbf U (t) accordng to Equatons 4-6 requres solvng a nonlnear optmzaton problem. To smplfy the problem, we consder a lnear upper bound approxmaton mbf U (t ) = α t + δ, as shown n Fgure 7, where X X α = e,j + emem /e,,j s,j compatble s,j predctable (10) and δ s the mnmum value such that t, mbf U (t ) mbf (t ). Usng mbf U (t ), Equatons 4-6 can be effcently solved n O(N ) tme. Furthermore, we can show that sbf U (t) can be computed for all relevant values of t by solvng the optmzaton problem of Equatons 7. Evaluaton In order to verfy the valdty and practcalty of PREM, we mplemented the key components of the system. In ths secton, we descrbe our evaluaton, frst ntroducng the new hardware components followed by the correspondng software drver and OS calbraton effort. We then dscuss our compler mplementaton and analyse ts effectveness on a DES benchmark. Fnally, usng synthetc tasks we measure the effectveness of the PREM system as a functon of cache stall tme, and show traces of PREM when runnng on COTS hardware PREM Hardware Components When ntroducng PREM n Secton 3, two addtonal hardware components were added to the COTS system (recall Fgure 1). Here, we brefly revew our prevous work whch descrbes these hardware components n more detal [1], and descrbe the addtons to these components whch we made to provde the mechansm for PREM executon. Frst we descrbe the real-tme brdge component, then we descrbe the perpheral scheduler component. Our real-tme brdge prototype s mplemented on an ML507 FPGA Evaluaton Platform, whch contans a COTS TEMAC Ethernet hardware block (the I/O perpheral under control), and a PCIe edge connector (the nterface to the man system). Interposed between these hardware components, our real-tme brdge uses a System-on-Chp desgn runnng Lnux. The real-tme brdge s also wred to perpheral scheduler, whch allows drect communcaton between these components. Three wres are used to transmt nformaton between each real-tme brdge and the perpheral scheduler, data ready, data block, and nterrupt block. The data ready wre s an output sgnal sent to the perpheral scheduler whch s asserted whenever the COTS perpheral has buffered data n the real-tme brdge. The perpheral scheduler sends two sgnals, data block and nterrupt block, to the real-tme brdge. The data block sgnal s asserted to block the real-tme brdge from transferrng data n DMA mode over the PCIe bus from ts local buffer to man memory or vceversa. The nterrupt block sgnal s a new sgnal n the real-tme brdge mplementaton, added to support PREM, and nstructs the real-tme brdge to not further rase nterrupts. In prevous work [1], we provde a more detaled descrpton and buffer analyss of

11 the real-tme brdge desgn, and demonstrate the component s effectveness and effcency. The perpheral scheduler s a unque component n the system, connected drectly to each real-tme brdge. Unlke our prevous work where the perpherals were scheduled asynchronously wth the CPU [1], PREM requres the CPU and perpherals to coordnate ther access to man memory. The perpheral scheduler provdes a hardware mplementaton of the chld schedulng model used to schedule perpherals, wth each perpheral gven a sporadc server to schedule ts traffc. Meanwhle, the OS real-tme scheduler runs the parent schedulng model for CPU tasks to complete the herarchcal PREM schedulng model descrbed n Secton 6. The perpheral scheduler s connected to the PCIe bus, and exposes a set of regsters accessble from the man CPU. In the confguraton regster, constant parameters such as the maxmum cache wrte-back tme are stored. Wrtng a value to the yeld regster ndcates that the CPU wll not access man memory for the gven amount of tme, and I/O perpherals should be allowed to read to and wrte from RAM. The value wrtten to the yeld regster contans a 14 bt unsgned nteger ndcatng the number of mcroseconds to permt perpheral traffc wth man memory, as descrbed n Secton 4. The CPU can also use the yeld regster to allow nterrupts to be rased by perpherals. Another 14 bt unsgned nteger ndcates the number of mcroseconds to allow perpheral nterrupts, and a 3 bt nterrupt mask selects whch perpherals should be allowed to rase the nterrupts. In ths way, dfferent CPU tasks can predctably servce nterrupts from dfferent I/O perpherals Software Evaluaton The software effort requred to mplement PREM executon nvolves two aspects: (1) creatng the drvers whch control the custom PREM hardware dscussed n the prevous secton, and (2) calbratng the OS to elmnate undesred executon nterference. We now dscuss these, n order. The two custom hardware components, the real-tme brdge and the perpheral scheduler, each requre a software drver to be controller from the man CPU. Addtonally, each perpheral requres a drver runnng on the realtme brdge s CPU to control the COTS perpheral. The drver for the perpheral scheduler s straghtforward, mappng the bus addresses correspondng to the exposed regsters to user space where a PREM-compled process can access them. The drver for each real-tme brdge s more dffcult, snce each unque COTS perpheral requres a unque drver. However, snce n our mplementaton both the man CPU and the real-tme brdge s CPU are runnng Lnux (verson ), we can reuse exstng, thoroughly tested Lnux drvers to drastcally reduce the drver creaton effort [1]. The presence of a real-tme brdge s not apparent n user space, and software programs usng the COTS perpherals requre no modfcaton. For our experments, we use a Intel Q6700 CPU wth a 975X system controller; we set the CPU frequency to 1Ghz obtanng a measured memory bandwdth of 1.8Ghz/s to confgure the system n lne wth typcal values for embedded systems. We also dsable the speculatve CPU HW prefetcher snce t negatvely mpacts the predctablty of any real-tme task. The Q6700 has four CPU cores and each par of cores shares a common level 2 (last level) cache. Each cache s 16-assocatve wth a total sze of B = 4 Mbytes and a lne sze of L = 64 bytes. Snce we use a PC platform runnng a COTS Lnux operatng system, there are many potental sources of tmng nose, such as nterrupts, kernel threads, and other processes, whch must be removed for our measurements to be meanngful. For ths reason, n order to emulate at our best a typcal un-processor embedded real-tme platform, we dvded the 4 cores n two parttons. The system partton, runnng on the frst par of cores, receves all nterrupts for non-crtcal devces (ex: the keyboard) and runs all the system actvtes and non real-tme processes (ex: the shell we use to run the experments). The real-tme partton runs on the second par of cores. One core n the real-tme partton runs our real-tme tasks together wth the drvers for real-tme brdges and the perpheral scheduler; the other core s not used. Note that the cores of the system partton can stll produce a small amount of unscheduled bus and man memory accesses, or rase rare nter-processor nterrupts (IPI) that can not be easly prevented. However, n our experments we found these sources of nose to be neglgble. Fnally, to solve the pagng ssue detaled n Secton 4, we used a large, 4MB page sze, just for the real-tme tasks, usng the HugeTLB feature of the Lnux kernel for large page support Compler Evaluaton We bult the PREM real-tme compler prototype usng the LLVM Compler Infrastructure [9], targetng the complaton of C code. LLVM was extended by wrtng selfcontaned analyss and transformaton passes, whch were then loaded nto the compler. In the current PREM real-tme compler prototype, we rely on the programmer to partton the task nto predctable and compatble ntervals. The parttonng s done by puttng each predctable nterval nto ts own functon. The begnnng and end of the schedulng nterval correspond to the entry and ext of the functon, and the start of executon phase (the end of memory access phase) s manually marked by the programmer. We assume that non-local data accessed durng a predctable nterval exsts n contnuous memory spaces, whch can be prefetched by a set of

12 PREFETCH DATA(start address, sze) macros that must be placed by the programmer durng the memory phase. The mplementaton of ths macro does the actual prefetchng of the data nto level 2 cache by prefetchng every cache lne n the gven range wth the 386 prefetcht2 nstructon. After the memory phase, the programmer adds a STARTEXECUTION(wcet) macro to ndcate the begnnng of the executon phase. Ths macro measures the amount of tme remanng n the predctable nterval usng the CPU performance counter, and wrtes the tme remanng to the yeld regster n the perpheral scheduler. All remanng operatons needed to transform the nterval are performed by a new LLVM functon pass. The pass terates over all the functons n a complaton unt. When a functon representng a predctable nterval s found, the pass performs code transformaton. Frst, our transform nlnes called functons usng preexstng LLVM nlnng functons. Ths ensures that there are only a sngle stack frame and segment of code that need to be prefetched nto the cache. Second, our transform nserts code to read the CPU performance counter at the begnnng of the nterval and save the current tme. Thrd, t nserts code to prefetch the stack frame and functon arguments. Brngng the stack frame nto the cache s done by nsertng nstructons nto the program to fetch the stack ponter and frame ponter. Code s then nserted to prefetch the memory between the stack ponter and slghtly beyond the frame ponter (to nclude functon arguments) usng the prefetcht2 nstructon. Fourth, the transform prefetches the code of the functon. Ths s done by transformng the program so that the functon s placed wthn a unque ELF secton. We then use a lnker scrpt to defne varables pontng to the begnnng and end of ths unque ELF secton. The compler then adds code that prefetches the memory nsde the ELF secton. Fnally, the pass dentfes all return nstructons nsde the predctable nterval functon and adds a specal functon eplog before them. The eplog performs nterval length enforcement by loopng untl the performance counter reaches the worst-case cycle count based on the tme value saved at the begnnng of the nterval. It may also enable perpheral nterrupts by wrtng the worst-case nterrupt processng tme to the perpheral scheduler s yeld regster. To verfy the correctness of the PREM real-tme compler prototype and to test ts applcablty, we used LLVM to comple a DES cypher benchmark. The DES benchmark was selected because t represents a typcal real-tme data flow applcaton. The benchmark comprses one schedulng nterval whch encrypts a varable amount of data. We compled t as both a predctable and a compatble nterval (e.g. wth and wthout prefetchng), and measured number of cache msses wth a performance counter. Adaptng the nterval requred no modfcaton to any cypher func- Data sze 4K 8K 32K 128K 512K 1M Compatble k 31k Predctable Table 1: DES benchmark. tons and a total of 11 PREFETCH DATA macros. Results are shown n Table 1 n terms of the number of cache msses suffered n the executon phase of the predctable nterval (after prefetchng), and n the entre compatble nterval. Data sze s n bytes. The compatble nterval suffers an excessve number of cache msses, whch ncreases roughly proportonally wth the amount of processed data. Conversely, the executon phase of the predctable nterval has almost zero cache msses, only sufferng a small ncrease when large amounts of data are beng processed. The reason the number of cache msses s not zero s that the Q6700 CPU core used n our experments uses a random cache replacement polcy, meanng that wth more than one contguous memory regon the probablty of self-evcton s non-zero. In all the followng experments, we observed that the number of self-evctons s typcally so small that t can be consdered neglgble WCET Experments wth Synthetc Tasks In ths secton, we evaluate the effects of PREM on the executon tme of a task. To quckly explore dfferent executon parameters, we developed two synthetc applcatons. In our lnear access applcaton, each schedulng nterval operates on a 256-klobyte global data structure. Data s accessed sequentally, and we vary the amount of computaton performed between memory references. The random access applcaton s smlar, except that references nsde the data structure are nonsequental. For each applcaton, we measured the executon tme after complng the program n two ways: nto predctable ntervals whch prefetch the accessed memory, and nto standard, compatble ntervals. For each type of complaton, we ran the experment n two ways, wth and wthout I/O traffc transmtted by an 8-lane PCIe perpheral wth a measured throughput of 1.2Gbytes/s. In the case of compatble ntervals, we transmtted traffc durng the entre nterval to mrror the worst case accordng to the tradtonal executon model. Fgures 8 and 9 show the observed worst case executon tme for any schedulng nterval as a functon of the cache stall tme of the applcaton, averaged over 10 runs. The cache stall tme represents the percentage of tme requred to fetch cache lnes out of an entre compatble nterval, assumng a fxed (best-case) fetch tme based on the maxmum measured man-memory throughput. Only a sngle lne s shown for predctable ntervals because experments confrmed that njectng traffc durng the executon phase

13 Executon Tme (ms) Executon Tme (ms) compatble/traffc compatble predctable Cache Stall Tme % Fgure 8: random access compatble/traffc compatble predctable Cache Stall Tme % Fgure 9: lnear access does not ncrease executon tme. In all cases, the computaton tme decreases wth an ncrease n stall tme. Ths s because stall tme s controlled by varyng the amount of computaton between memory references. Furthermore, executon tmes should not be compared between the two fgures because the two applcatons execute dfferent code. In the random access case, predctable ntervals outperform compatble ntervals (wthout perpheral traffc) by up to 28%, dependng on the cache stall tme. We beleve ths effect s prmarly due to the behavor of DRAM man memory. Specfcally, accesses to adjacent addresses can be served qucker n burst mode than accesses to random addresses. Thus, we can decrease the executon tme by loadng all the accessed memory nto cache, n order, at the begnnng of each predctable nterval. Furthermore, note that transmttng perpheral traffc durng a compatble nterval can ncrease executon tme by more than 60% n the worst case. In Fgure 9, predctable ntervals perform worse than compatble ntervals (wthout perpheral traffc). We beleve ths s manly due to out-of-order executon n the Q6700 core. In compatble ntervals, whle the core performs a cache fetch, nstructons n the ppelne that do not depend on the fetched data can contnue to execute. When performng lnear accesses, fetches requre less tme and ths effect s magnfed. Furthermore, the gan n executon tme for the case wth perpheral traffc s decreased: ths occurs because burstng data on the memory bus reduces the amount of blockng tme suffered by a task due to perpheral nterference (ths effect has been prevously analyzed n detal [15]). In practce, we expect the effect of PREM on an applcaton s executon tme to be between the two fgures, based on the specfc memory access pattern System-wde Coschedulng Traces We now present executon traces of the mplemented system whch demonstrate the advantage of the PREM coschedulng approach. The traces are obtaned by usng the perpheral scheduler as a logc analyzer for the varous sgnals whch are beng sent to or from the real-tme brdges, data block, data ready, and nterrupt block. Addtonally, the perpheral scheduler has a trace regster whch allows tmestamped trace nformaton to be recorded wth a one mcrosecond resoluton when nstructed by the man CPU, such as at the start and end of an executon nterval. An executon trace s shown for a task runnng the tradtonal COTS executon model n Fgure 10, and the same task runnng wthn the PREM model s shown n Fgure In the frst trace (Fgure 10), although the executon s dvded nto unpreemptable ntervals, there s no memory phase prefetch or constant executon tme guarantees. When the schedulng ntervals of task T 1 fnsh executng, an I/O perpheral begns to access man memory, whch may happen f T 1 had wrtten output data nto RAM. Task T 2 then executes, sufferng cache msses that compete for man memory bandwdth wth the I/O perpheral. Due to the cold cache and perpheral nterference, the executon tme of T 2 grows from 0.5 ms (the executon tme wth warm cache and no perpheral I/O), to 2.9ms as shown n the fgure, an ncrease of about 600%. In the second trace (Fgure 11), the system executes accordng to the PREM executon model, where perpherals only access the bus when permtted by the perpheral scheduler. The predctable nterval s dvded nto a memory phase and an executon phase. Instead of competng for man memory access, task T 2 gets contentonless access to 3 The (compatble) ntervals at the end of T 1 and at the start of T 2 were measured as 0.107ms and 0.009ms, respectvely, and have been exaggerated n the fgures to be vsble.

14 ple processor systems. T 1 T 2 Schedulng Intervals 2.9ms References T 1 I/O Memory Access Fgure 10: An unscheduled bus-access trace (wthout PREM) T 1 T 2 T 1 I/O Compatble Intervals Predctable Memory Phase 1.6ms Fgure 11: A scheduled trace usng PREM Predctable Executon Phase Memory Access man memory durng the memory phase. After all to-beaccessed data s loaded nto the cache, the executon phase begns whch ncurs no cache msses, and the perpheral s allowed to access the data n man memory. The constant executon tme for the predctable nterval n the PREM executon model s 1.6ms, whch s sgnfcantly lower than the worst-case observed for the unscheduled trace (and s about the same as the executon tme of the schedulng nterval of T 2 n the unscheduled trace wth a cold cache and no perpheral traffc). 8. Conclusons We have dscussed the concept and mplementaton of a novel task executon model, PRedctable Executon Model (PREM). Our evaluaton shows that by enforcng a hghlevel co-schedule among CPU tasks and perpherals, PREM can greatly reduce or outrght elmnate low-level contenton for shared resource access. We plan to further develop our soluton n two man drectons. Frst, we wll study extensons to our compler nfrastructure to lft some of the more restrctve code assumptons and comple and test a larger set of benchmarks. Second, t s mportant to recall that contenton for shared resources becomes more severe as the number of actve components ncreases. In partcular, worst-case executon tme can greatly degrade n multcore systems [16]. Snce PREM can make the system contentonless, we predct that the benefts of our approach wll become even more sgnfcant when appled to mult- [1] S. Bak, E. Bett, R. Pellzzon, M. Caccamo, and L. Sha. Real-tme control of I/O COTS perpherals for embedded systems. In Proc. of the 30th IEEE Real-Tme Systems Symposum, Washngton DC, Dec [2] G. Buttazzo. Hard Real-Tme Computng Systems: Predctable Schedulng Algorthms and Applcatons. Kluwer Academc Publshers, Boston, [3] S. A. Edwards and E. A. Lee. The case for the precson tmed (pret) machne. In DAC 07: Proc. of the 44th annual Desgn Automaton Conference, [4] T. Facchnett, G. Buttazzo, M. Marnon, and G. Gud. Non-preemptve nterrupt schedulng for safe reuse of legacy drvers n real-tme systems. In ECRTS 05: Proc. of the 17th Euromcro Conf. on Real-Tme Systems, pages , [5] H. Falk, P. Lokucejewsk, and H. Thelng. Desgn of a wcet-aware c compler. In ESTMED 06: Proc. of the 2006 IEEE/ACM/IFIP Workshop on Embedded Systems for Real Tme Multmeda, pages , [6] C. Ferdnand, F. Martn, and R. Wlhelm. Applyng compler technques to cache behavor predcton. In Proceedngs of the ACM SIGPLAN Workshop on Languages, Complers and Tools for Real-Tme Systems, [7] D. Grund and J. Reneke. Precse and effcent FIFOreplacement analyss based on statc phase detecton. In Proceedngs of the 22nd Euromcro Conference on Real-Tme Systems (ECRTS), Brussels, Belgum, July [8] K. Hoyme and K. Drscoll. Safebus(tm). IEEE Aerospace Electroncs and Systems Magazne, pages 34 39, Mar [9] C. Lattner and V. Adve. LLVM: A complaton framework for lfelong program analyss and transformaton. In Proc. of the Internatonal Symposum of Code Generaton and Optmzaton, San Jose, CA, USA, Mar [10] M. Lewandowsk, M. Stanovch, T. Baker, K. Gopalan, and A. Wang. Modelng devce drver effects n real-tme schedulablty: Study of a network drver. In Proc. of the 13 th IEEE Real Tme Applcaton Symposum, Apr [11] J. Ledtke, H. Hartg, and M. Hohmuth. OS-controlled cache predctablty for real-tme systems. In Proceedngs of the 3rd IEEE Real-Tme Technology and Applcatons Symposum (RTAS), [12] F. Mueller. Tmng analyss for nstructon caches. Real Tme Systems Journal, 18(2/3): , May [13] M Paoler, E Qunones, F. J. Cazorla, and M. Valero. An analyzable memory controller for hard real-tme CMPs. IEEE Embedded System Letter, 1(4), Dec [14] PCI SIG. Conventonal PCI 3.0, PCI-X 2.0 and PCI-E 2.0 Specfcatons. [15] R. Pellzzon and M. Caccamo. Impact of perpheralprocessor nterference on wcet analyss of real-tme embedded systems. IEEE Trans. on Computers, 59(3): , Mar 2010.

of the 19th Euromcro Conf. on Real-Tme Systems, pages 169 178, 2007. [18] H. Ramaprasad and F. Mueller. Boundng preempton delay wthn data cache reference patterns for real-tme tasks. In Proc.

15 [16] R. Pellzzon, A. Schranzhofer, J.-J. Chen, M. Caccamo, and L. Thele. Worst case delay analyss for memory nterference n multcore systems. In Proceedngs of Desgn, Automaton and Test n Europe (DATE), Dresden, Germany, Mar [17] I. Puaut and D. Hardy. Predctable pagng n real-tme systems: A compler approach. In ECRTS 07: Proc. of the 19th Euromcro Conf. on Real-Tme Systems, pages , [18] H. Ramaprasad and F. Mueller. Boundng preempton delay wthn data cache reference patterns for real-tme tasks. In Proc. of the IEEE Real-Tme Embedded Technology and Applcaton Symposum, Apr [19] J. Reneke, D. Grund, C. Berg, and R. Wlhelm. Tmng predctablty of cache replacement polces. Real-Tme Systems, 37(2), 207. [20] S. Schlecker, M. Negrean, G. Ncolescu, P. Pauln, and R. Ernst. Relable performance analyss of a multcore multthreaded system-on-chp. In CODES/ISSS, [21] I. Shn and I. Lee. Perodc resource model for compostonal real-tme guarantees. In Proceedngs of Proceedngs of the 23th IEEE Real-Tme Systems Symposum, Cancun, Mexco, Dec [22] J. Whtham and N. Audsley. Implementng tme-predctable load and store operatons. In Proc. of the Intl. Conf. on Embedded Systems (EMSOFT), Grenoble, France, Oct A. Analyss of Cache Replacement Polcy We frst provde a bref overvew of each analyzed polcy. Under random polcy, whenever a new cache lne s loaded n an assocatve set, the lne to be evcted s chosen at random among all N lnes n the set. Whle ths polcy s mplemented n some modern COTS CPUs wth very large cache assocatvty, such as Intel Core archtecture, t s clearly ll suted to real-tme systems. In FIFO polcy, a FIFO queue s assocated wth each assocatve set as shown n Fgure 12(a). Each newly fetched cache lne s nserted at the head of the queue (poston q 1 ), whle the cache lne whch was prevously at the back of the queue (poston q N ) s evcted. In the fgure, l 1,..., l 4 represent cache lnes used n the predctable nterval, whle dashes represent other cache lnes avalable n cache but unrelated to the predctable nterval. Fnally, grayed boxes represent nvaldated cache lnes. The Least Recently Used (LRU) polcy also uses a replacement queue, but a cache lne s moved at the head of the queue whenever t s accessed. Due to the complexty of mplementng LRU, an approxmated verson of the algorthm known as pseudo-lru can be employed where N s a power of 2. A bnary replacement tree wth N 1 nternal nodes and N leaves q 1,..., q N s constructed as shown n Fgure 13 for N = 4. Each nternal node encodes a rght or left drecton usng a sngle bt of data, whle each leaf encodes a fetched cache lne. Whenever a replacement decson must be taken, the encoded drectons are followed startng from the root to the leaf representng the cache lne q 8 q 7 q 6 q 5 q 4 q 3 q 2 q 1 #$%&!'()*$!!!!!!!!!! l 1 l 2 l 3 l 4!"#$ 1$23(!/!.0,$*! l 3 l 4 +,-).&!/!.0,$*! #$%&!'()*$! l 3 l 4 l 3 l 4 l 1 l 2 1$23(!/!.0,$*! l 2 l 1 #$%&!'()*$!!!!!!!!!! l 3 l 4!%#$!&#$!'#$!(#$!)#$ Fgure 12: FIFO polcy, example replacement queue. to be replaced. Furthermore, whenever a cache lne s accessed, all drectons on the path from the root to the leaf of the accessed cache lne are set to be the opposte of the followed path. Fgure 13 shows an example where four cache lnes are accessed n the order l 1, l 2, l 3, l 2, l 4, causng a selfevcton. Wthout loss of generalty, n all proofs n ths secton we focus on the analyss of a sngle assocatve set wth Q cache lnes accessed durng the predctable nterval. Furthermore, let l 1,..., l Q be the set of Q cache lnes n the order n whch they are frst accessed durng the memory phase (note that ths order can be dfferent from the order n whch the lnes are prefetched due to the Q lnes that are accessed multple tmes). Results for LRU and random are smple and well-known, see [6] for example; we detal the bound dervaton n the followng theorem to provde a better understandng of the replacement polces. Theorem 6. A memory phase wll not suffer any cache selfevcton n an assocatve set requrng Q entres, f Q s at most equal to: 1: for random replacement polcy; N: for LRU replacement polcy. Proof. Clearly no self-evcton s possble f Q = 1, snce after the unque cache lne s fetched no other cache lne can be accessed n the assocatve set. Now consder LRU polcy. When a cache lne l s frst accessed, t s fetched (f t s not already n cache) and moved to the begnnng of the replacement queue. Snce no cache lne outsde of l 1,..., l Q s accessed n the assocatve set, when l s frst accessed, cache lnes l 1,..., l 1 must be at the head of the queue n some order. But Q N mples 1 N 1, hence the algorthm always pcks an unrelated lne at the back of the queue to be replaced.

16 #$%&'(")*'*+",+*-."l 3 #$%&'(")*'*+",+*-."l 1,+*-."l 2! "! "! "! " l 1! "! "! " l 1! " l 2! " q 1 q 2 q 3 q 4 q 1 q 2 q 3 q 4 q 1 q 2 q 3 q 4!"#$!%#$!&#$ /--+)) l 2,+*-."l 4 l 1 l 3 l 2! " l 1 l 3 l 2! " l 4 l 3 l 2! "!(#$!'#$!)#$ Fgure 13: Pseudo-LRU polcy, full-nvaldaton, Q = 1.,--+))"l 1, l 2, l 3.+*-/"l 4 l 1 l 2 l 3! " l 1 l 2 l 3! " l 4 l 2 l 3! " q 1 q 2 q 3 q 4 q 1 q 2 q 3 q 4 q 1 q 2 q 3 q 4!"#$!%#$!&#$ Fgure 14: Pseudo-LRU polcy, partal-nvaldaton. Note that for random the bound of 1 s tght, snce n the worst-case fetchng a second cache lne n the assocatve set can cause the frst cache lne to be evcted. Furthermore, note that f the cache s not nvaldated between successve actvatons of the same predctable nterval, then some of the l 1,..., l Q cache lnes can stll be avalable n cache. Assume that at the begnnng of the predctable nterval l 2 s avalable but not l 1. Then even wth LRU polcy, prefetchng l 1 can potentally cause l 2 to be evcted. Accordng to our defnton, ths s not a self-evcton: l 2 s evcted before t s accessed durng the memory phase, and s then reloaded when t s frst accessed, thus guaranteeng that all requred lnes are avalable n cache at the end of the memory phase. In fact, Theorem 6 proves that the bounds for random and LRU are ndependent of the state of the cache before the begnnng of the predctable nterval. Unfortunately, the same s not true of FIFO and pseudo- LRU. Consder FIFO polcy. If a cache lne l s already avalable n cache, then when l s accessed durng the memory phase no modfcaton s performed on the replacement queue. In some stuatons ths can cause l to be evcted by another lne fetched after l n the memory phase, causng a self-evcton. Due to ths added complexty, when analyzng FIFO and pseudo-lru we need to consder the nvaldaton model. A detaled analyss of FIFO replacement s provded n [7]; here we summarze the results and ntutons relevant to our system. Theorem 7 (Drectly follows from Lemma 1 n [7]). Assume FIFO replacement wth full-nvaldaton semantc. Then no self-evcton s possble for an assocatve set wth Q N. The analyss for partal-nvaldaton FIFO s more complex. Fgure 12 (analogous to Fgure 2 n [7]) depcts a clarfyng example, where the mem. phase, fetch and nval. labels show the state of the replacement queue after the memory phase and after some cache lnes have been ether fetched or nvaldated outsde the predctable nterval, respectvely. Note that after Step (e), lnes l 1, l 2 reman n cache but not l 3, l 4. Hence, when l 3 and l 4 are fetched durng the next memory phase n Step (f), they evct l 1 and l 2. The soluton s to prefetch l 1,..., l Q multple tmes n the memory phase, each tme n the same order. Theorem 8 (Drectly follows from Theorem 3 n [7]). Assume FIFO replacement wth partal-nvaldaton semantc. If the Q cache lnes n an assocatve set, wth Q N, are prefetched at least Q tmes n the same access-order l 1,..., l Q, then all Q cache lnes are avalable n cache at the end of the memory phase. Furthermore, no more than Q fetches are requred. Note that whle executng the prefetchng code Q tmes mght seem onerous, n practce the majorty of the tme n the memory phase s spent fetchng and wrtng back cache lnes, and Theorem 8 ensures that no more than Q fetches and wrte backs are requred for each assocatve set. Whle LRU and FIFO allow to prefetch up to N cache lnes wth N fetches, the same s not true of pseudo-lru, even n the full-nvaldaton model. Consder agan Fgure 13, where no relevant lnes are avalable n cache at the begnnng of the memory phase but Q = 1, meanng that lne l 2 can be accessed multple tmes. Then t s easy to see that no more than 3 cache lnes can be prefetched wthout causng a self-evcton. The key ntuton s that up to Step (d), and wth respect to l 1, l 2, l 3, the replacement tree encodes the LRU order exactly. However, after l 2 s accessed agan at Step (e), the LRU order s lost, and the next replacement causes l 1 to be evcted nstead of the remanng unrelated cache lne. Furthermore, note that the strategy employed n Theorem 8 does not work n ths case, because l 1 has already been fetched durng the memory nterval before beng evcted. A smlar stuaton can happen even f Q = 0 n the partal-nvaldaton model, as shown n Fgure 14. Assume that due to partal replacements and cache accesses, at the begnnng of the memory phase the cache state s as shown n Fgure 14(a), wth l 1, l 2 and l 3 beng avalable n cache but not l 4. Then after the frst three cache lnes are prefetched n order, l 1 wll be replaced when l 4 s next prefetched. A lower bound that s ndependent of Q and of the nvaldaton model was frst proven n [19].

Real-Time Systems. Multiprocessor scheduling. Multiprocessor scheduling. Multiprocessor scheduling

Real-Time Systems. Multiprocessor scheduling. Multiprocessor scheduling. Multiprocessor scheduling Real-Tme Systems Multprocessor schedulng Specfcaton Implementaton Verfcaton Multprocessor schedulng -- -- Global schedulng How are tasks assgned to processors? Statc assgnment The processor(s) used for