Multiprocessor Real-Time Systems with Shared Resources: Utilization Bound and Mapping

Similar documents
New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Improved multiprocessor global schedulability analysis

List Scheduling and LPT Oliver Braun (09/05/2017)

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Homework 3 Solutions CSE 101 Summer 2017

Schedulability Analysis of Non-preemptive Real-time Scheduling for Multicore Processors with Shared Caches

Fairness via priority scheduling

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Block designs and statistics

On Constant Power Water-filling

Real-Time Systems. Lecture #14. Risat Pathan. Department of Computer Science and Engineering Chalmers University of Technology

Analyzing Simulation Results

time time δ jobs jobs

Department of Electronic and Optical Engineering, Ordnance Engineering College, Shijiazhuang, , China

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Non-Parametric Non-Line-of-Sight Identification 1

arxiv: v1 [cs.ds] 3 Feb 2014

1 Identical Parallel Machines

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Capacity Augmentation Bounds for Parallel DAG Tasks under G-EDF and G-RM

A method to determine relative stroke detection efficiencies from multiplicity distributions

Defect-Aware SOC Test Scheduling

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

A note on the multiplication of sparse matrices

An improved self-adaptive harmony search algorithm for joint replenishment problems

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

AS computer hardware technology advances, both

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

SPECTRUM sensing is a core concept of cognitive radio

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

A Note on Online Scheduling for Jobs with Arbitrary Release Times

On the Analysis of the Quantum-inspired Evolutionary Algorithm with a Single Individual

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Scheduling Contract Algorithms on Multiple Processors

COS 424: Interacting with Data. Written Exercises

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

LONG-TERM PREDICTIVE VALUE INTERVAL WITH THE FUZZY TIME SERIES

An Adaptive UKF Algorithm for the State and Parameter Estimations of a Mobile Robot

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

IN modern society that various systems have become more

Response-Time Analysis of Synchronous Parallel Tasks in Multiprocessor Systems

Multiscale Entropy Analysis: A New Method to Detect Determinism in a Time. Series. A. Sarkar and P. Barat. Variable Energy Cyclotron Centre

About the definition of parameters and regimes of active two-port networks with variable loads on the basis of projective geometry

Dependency Graph Approach for Multiprocessor Real-Time Synchronization. TU Dortmund, Germany

Lecture 6. Real-Time Systems. Dynamic Priority Scheduling

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

Decentralized Adaptive Control of Nonlinear Systems Using Radial Basis Neural Networks

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Kernel-Based Nonparametric Anomaly Detection

Worst-case performance of critical path type algorithms

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Midterm 1 Sample Solution

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

INTEGRATIVE COOPERATIVE APPROACH FOR SOLVING PERMUTATION FLOWSHOP SCHEDULING PROBLEM WITH SEQUENCE DEPENDENT FAMILY SETUP TIMES

2 Q 10. Likewise, in case of multiple particles, the corresponding density in 2 must be averaged over all

A Simple Regression Problem

Effective joint probabilistic data association using maximum a posteriori estimates of target states

When Short Runs Beat Long Runs

1 Bounding the Margin

Ensemble Based on Data Envelopment Analysis

Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution

Sharp Time Data Tradeoffs for Linear Inverse Problems

An Extension to the Tactical Planning Model for a Job Shop: Continuous-Time Control

Sensorless Control of Induction Motor Drive Using SVPWM - MRAS Speed Observer

Polygonal Designs: Existence and Construction

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

Cache Related Preemption Delay for Set-Associative Caches

Simulation of Discrete Event Systems

Non-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks

Physics 215 Winter The Density Matrix

arxiv: v3 [cs.ds] 22 Mar 2016

Mathematical Model and Algorithm for the Task Allocation Problem of Robots in the Smart Warehouse

Interactive Markov Models of Evolutionary Algorithms

Curious Bounds for Floor Function Sums

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

The Transactional Nature of Quantum Information

Statistical Logic Cell Delay Analysis Using a Current-based Model

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Impulsive Control of a Mechanical Oscillator with Friction

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

A Note on the Applied Use of MDL Approximations

State Estimation Problem for the Action Potential Modeling in Purkinje Fibers

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Multi-Dimensional Hegselmann-Krause Dynamics

Numerical Studies of a Nonlinear Heat Equation with Square Root Reaction Term

Data-Driven Imaging in Anisotropic Media

Chapter 6 1-D Continuous Groups

STOPPING SIMULATED PATHS EARLY

arxiv: v1 [math.nt] 14 Sep 2014

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Adapting the Pheromone Evaporation Rate in Dynamic Routing Problems

Transcription:

CS-TR-2013-002, Departent of Coputer Science, UTSA, 2013 Multiprocessor Real-Tie Systes with Shared Resources: Utilization Bound and Mapping Jian-Jun Han, Meber, IEEE, Daai Zhu, Meber, IEEE, Xiaodong Wu, Laurence T. Yang, Senior Meber, IEEE Hai Jin, Senior Meber, IEEE, 1 Abstract In real-tie systes, both scheduling theory and resource access protocols have been studied extensively. However, there is very liited research focusing on scheduling algoriths specifically for real-tie systes with shared resources, where the proble becoes ore proent with the eergence of ulticore processors. In this paper, focusing on the partitioned-edf scheduling with the MSRP resource access protocol, we study the utilization bound and efficient tas apping schees for a set of periodic real-tie tass running on a ulticore/ultiprocessor syste with shared resources. Specifically, we first illustrate the scheduling anoaly where the tass are schedulable when being apped on to fewer but not ore processors due to synchronization overhead. Then, with such synchronization overhead being considered, we develop a synchronization-cognizant utilization bound. Moreover, we show that finding the optial apping of tass with shared resources is NP-hard and propose two efficient synchronization-cognizant tas apping algoriths (SC-TMA) that rely on the new tightened synchronization overhead and have the goal of achieving better schedulability and balanced worload on deployed processors. Finally, the proposed SC-TMA schees are evaluated through extensive siulations with synthetic tass. The results show that, the schedulability ratio and (average) syste load under SC-TMA are close to that of an INLP (Integer Non-Linear Prograg) based solution for sall tas systes. When copared to the existing tas apping algoriths, SC-TMA obtain uch better schedulability ratio and lower/balanced worload on all processors. Index Ters Real-tie systes; Multiprocessor; Periodic tass; Shared resources; Partitioned scheduling; Utilization bound; 1 INTRODUCTION The scheduling theory for real-tie systes has been studied for decades and any scheduling algoriths have been developed. Although the uniprocessor scheduling theory has been coprehensively studied where the earliest deadline first (EDF) and rate onotonic scheduling (RMS) are the wellnown optial scheduling algoriths [30], the scheduling of real-tie tass in ultiprocessor systes is still an evolving research field and any probles reain open due to their intrinsic difficulties [15], [40]. Moreover, with the eergence of ulticore processors, there is a reviving interest in the ultiprocessor real-tie scheduling proble and any results have been reported recently [8], [14], [22], [23], [29]. There have been two traditional approaches to the ultiprocessor scheduling proble: partitioned and global scheduling [16], [17]. In partitioned scheduling, tass are statically assigned to processors and a tas can only run on its designated processor where the well-established uniprocessor scheduling algoriths (e.g., EDF and RMS) can be eployed on each processor. In coparison, all tass in global scheduling are put into a shared queue and every idle processor fetches the next highest-priority ready tas fro the global queue for execution. More recently, as a general and hierarchical approach, cluster scheduling has been investigated, where tass are first partitioned aong several clusters of processors and then scheduled with different global scheduling policies J.-J. Han, X. Wu, L. T. Yang and H. Jin are with School of Coputer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China. D. Zhu is with Departent of Coputer Science, The University of Texas at San Antonio, San Antonio, TX 78249, USA. (e.g., Global-EDF) within each cluster [7], [37], [42]. To efficiently detere whether a given tas set is schedulable, several utilization bounds have been developed for partitioned scheduling based on the siple tas apping heuristics (such as first-fit, best-fit and worst-fit) when both EDF [32] and RMS [31], [35] are considered on each processor. Siilarly, by liiting the axiu tas utilization, such utilization bounds have also been studied for global-edf and global- RMS scheduling algoriths [2], [6], [21]. Note that, these utilization bounds can only be applied to tas systes that do not have shared resources. In soe real-tie systes, due to resource liitation and/or synchronization requireents, tass ay need to exclusively access shared resources (such as shared data objects or I/O channels). Such resource access contention can lead to significantly degraded schedulability due to priority inversion, where a lower-priority tas accesses a shared resource nonpreeptively and thus blocs the execution of a higher-priority tas [38]. For instance, a recent study pointed out that an application could tae up to 30% ore tie due to access contention for shared data resources on a 16-processor syste [10]. As ulticore processors, where ultiple processing cores are integrated on a single chip with shared last level cache and I/O channels [36], eerge to be the coputing engine for odern real-tie ebedded systes [12], [33], such resource access contention probles will becoe ore proent and deand for efficient scheduling solutions. To tacle the priority inversion proble, several loc-based resource access protocols have been investigated, such as Priority Ceiling Protocol (PCP) [41] and Stac Resource Policy (SRP) [5] for uniprocessor systes and the corresponding extensions MPCP [28] and MSRP [20] for ultiprocessor

2 systes. These protocols have been exploited to guarantee the tieliness of tass when accessing shared resources. However, ost existing wor focused on either scheduling algoriths or resource access protocols. There is only liited research on scheduling algoriths for tass with shared resources that tae resource access protocols into consideration to iprove tas schedulability [24], [25], [34]. In [25], focusing on fixed priority (i.e., RMS) scheduling, Hsiu et al. developed a dedicated-core fraewor where several service tass running on a few dedicated cores control the access right for shared resources in critical sections using a RPC-lie echanis and application tass are apped to processors with siple heuristics (e.g., First-Fit). Considering the RMS scheduling and MPCP, Neati et al. studied a best-fit decreasing (BFD) tas apping heuristic based on acrotass (which contain tass that directly or indirectly access the sae resources) [34]. The schee tries to ap tass in a acrotas to a processor to reduce synchronization overhead. Moreover, focusing on partitioned-edf, Han et al. proposed a synchronization-aware worst-fit decreasing (SA-) tas apping schee very recently that tries to allocate tass accessing a siilar set of resources to a processor for better schedulability [24]. However, the existing wor either focused on siple heuristics that does not directly tae synchronization overhead into consideration [25] or adopted very pessiistic estiation for such overhead [24], [34] when aing scheduling/apping decisions. Moreover, to the best of our nowledge, there is no existing wor that studied the utilization bound for real-tie tass that access shared resources in ultiprocessor systes. In this paper, focusing on partitioned-edf and MSRP, we study the utilization bound and efficient tas apping schees for real-tie tass running on a ultiprocessor syste with shared resources under the MSRP resource access protocol. The contributions of this wor can be suarized as follows. We discover and forally show the scheduling anoaly proble where a set of tass are schedulable when being apped on to fewer but not ore processors due to synchronization overhead for accessing shared resources; We develop the first synchronization-cognizant utilization bound for partitioned-edf scheduling that taes the synchronization overhead of tass accessing shared resources into consideration; We show that the apping proble of tass with shared resources is NP-hard and propose two synchronizationcognizant tas apping heuristics aig at reducing synchronization overhead and obtaining better schedulability and balanced worload on deployed processors. The reainder of this paper is organized as follows. Section 2 reviews closely-related research. Section 3 presents syste odels and soe preliaries. The utilization bound is developed in Section 4. Section 5 explores a new ethod to tighten overhead and Section 6 presents the synchronizationcognizant tas apping algoriths. Siulation results are discussed in Section 7 and Section 8 concludes the paper. 2 RELATED WORK For a set of periodic real-tie tass running on uniprocessor systes, the optial static/fixed priority based RMS scheduling algorith has a utilization bound of N(2 1 N 1), where N is the nuber of tass under consideration, and the dynaic priority based EDF scheduler can achieve 100% syste utilization [30]. However, since a real-tie tas cannot occupy ore than one processor at any tie, the global-scheduling based EDF and RMS algoriths could fail to schedule tas sets with very low syste utilization in ultiprocessor systes [17]. When the axiu tas utilization for a tas set is liited and no ore than α ( 1), it has been shown that a tas set is schedulable under global-edf if the syste utilization does not exceed M(1 α)+α, where M is the nuber of processors in the syste [21]. Siilarly, for global-rms scheduling, a syste utilization of (M/2)(1 α)+α can be guaranteed [6]. Andersson et al. also studied the scheduling algorith RM- US, where tass with utilization higher than soe threshold θ have the highest priority [2]. When θ = 1 3, Baer showed that RM-US can guarantee a syste utilization of (M + 1)/3 [6]. With the goal of achieving 100% syste utilization, several optial global-scheduling based algoriths have been studied. For instance, the well-nown proportional fair (Pfair) scheduler enforces proportional progress (i.e., fairness) for each tas at every tie quantu [9]. Later, observing that a periodic real-tie tas can only iss its deadline at its period boundary, the boundary-fair (Bfair) scheduling algorith was studied that aes scheduling decisions and ensures fairness for tass only at the period boundaries to reduce scheduling overhead [43]. For continuous-tie based systes, the T- L plane based scheduling algoriths were studied in [13], [19] and a generalized deadline-partitioned fair (DP-Fair) scheduling odel was investigated in [29]. More recently, an optial approach RUN was studied that reduces the proble to be a uniprocessor scheduling proble [39]. Although these optial global schedulers can achieve full syste utilization, all of the could incur quite high scheduling overhead due to excessive nuber of scheduling points and context switches. For partitioned scheduling, Oh and Baer studied the rateonotonic first-fit (RMFF) heuristic and showed the utilization bound for RMFF on a M-processor syste is M(2 1/2 1) [35]. Later, a better bound of (M + 1)(2 1/(M+1) 1) for RMFF was derived in [31]. In [3], Andersson and Jonsson proved that the syste utilization bound can reach 50% for a partitioned RM scheduling by exploiting the haronicity of tass periods. For the partitioned-edf with reasonable apping heuristics, Lopez et al. showed that any tas set can be successfully scheduled if the total utilization is no ore than (β M + 1)/(β + 1), where β = 1/α and α is the axiu tas utilization [32]. Andersson et al. proposed the EKG algorith based on the concept of portion tass [4], which can trade higher scheduling overhead for better syste utilization (within the range of 66% to 100%). Following the sae line of research, several sei-partitioning based scheduling algoriths have been proposed that have different handling echaniss for portion tass and thus achieve different syste utilization bounds [23], [26], [27].

3 Note that, all the above studies are for real-tie tass that do not have shared resources. As entioned earlier, several resource access protocols have been studied to control the access of shared resources aong tass and thus guarantee their tieliness. For exaple, the loc-based Priority Ceiling Protocol (PCP) [41] and Stac Resource Policy (SRP) [5] for uniprocessor systes. Such protocols have been extended for ultiprocessor systes to tacle the blocing caused by tass on different processors that access the sae shared resources, such as MPCP (Multiprocessor PCP) [28] and MSRP (Multiprocessor SRP) [20]. Moreover, in [18], Easwaran et al. studied a parallel PCP (P-PCP) resource-sharing protocol for fixed-priority tass in ultiprocessor systes and developed the corresponding schedulability conditions. Recently, Bloc et al. introduced a Flexible Multiprocessor Locing Protocol (FMLP) [10] and Brandenburg et al. studied an asyptotically optial locing protocol (OMLP) [11]. To iprove the schedulability of real-tie tass that access shared resources in ultiprocessor systes, a few research studies have focused on the partitioning heuristics with the resource access protocols being considered. In [25], Hsiu et al. developed a dedicated-core fraewor that delegates the control of resource access to soe fixed cores in a ulticore real-tie syste, where service tass govern the resource access in critical sections using RPC lie echanis. The partitioning of application tass to cores are transfored to three core-iization probles (i.e., iizing the nuber of the service cores, application cores and total cores) under the tie and synchronization constraints. They eployed several siple heuristics (e.g., First-Fit) for tas assignents and tass on each core are scheduled with RMS [25]. The ost related previous studies were recently reported in [24], [34]. Focusing on RMS scheduling on each processor and MPCP, Neati et al. studied a cluster-based approach to allocate tass that share resources in a ulticore syste based on best-fit decreasing (BFD) heuristic [34]. They tried to fit a acrotas (which consists of tass that directly or indirectly access the sae resources) in a core. Once a acrotas cannot fit into a core, it is broen and its coponent tass will be assigned together with other tass using the pre-defined weight and attraction functions that easure the degree of resource access contention subject to MPCP. In [24], based on EDF and MSRP, Han et al. proposed a synchronization-aware worst-fit decreasing (SA-) tas partitioning heuristic that tries to allocate tass accessing a siilar set of resources to a single core for better schedulability. The wor reported in this paper is different fro the existing wor, which either did not directly tae the synchronization overhead into consideration [25] or have adopted siple synchronization characteristics (such as resource siilarity) and very pessiistic estiation of such overhead when aing scheduling/apping decisions [24], [34]. Observing that the synchronization overhead has a great ipact on tas schedulability, we exploit an iterative approach to tighten such overhead during the tas apping process, which can significantly iprove tas schedulability as shown in the evaluation results. Moreover, we developed the first utilization bound for realtie tass in ultiprocessor systes with shared resources. 3 PRELIMINARIES In this section, we first layout the scope of this study by presenting the syste, tas and resource odels and stating our assuptions. Moreover, the MSRP resource access protocol and the schedulability condition for partitioned-edf [5], [20] are briefly reviewed, followed by the description of the proble to be addressed in this paper. 3.1 Syste Models We consider a hoogeneous ultiprocessor real-tie syste that consists of M processors ({P 1,..., P M }) with identical functions and capabilities. The syste has a set of R global resources R = {R 1,..., R R }, which can be shared by all tass. A set of N periodic real-tie tass Ψ = {τ 1,..., τ N } will be executed on the syste and each tas τ i has a period p i, which is also its relative deadline. That is, the j th tas instance (or job) of tas τ i arrives at tie (j 1) p i and has to coplete its execution by its absolute deadline j p i. Note that, there is at ost one active job for a tas at any tie. Thus, without causing abiguity, we use tas and job exchangeably for the reainder of this paper. A resource can be accessed by only one tas at any tie within its critical sections. That is, the access to any resource by a tas is exclusive and non-preeptable. Moreover, we assue that there is no nested critical section as it occurs infrequently in practice and can be tacled by group locs [10]. Therefore, a tas is not allowed to request for ore than one resource (and thus can only access one resource) at any tie. We assue that there are n i sections in tas τ i and the j th section (1 j n i ) is denoted as s i,j. The worst case execution tie (WCET) of section s i,j is c i,j and τ i s WCET is given as c i = n i j=1 c i,j. The utilization of tas τ i is defined as u i = ci p i. The syste utilization is given as U = N i=1 u i. Moreover, to precisely odel the critical sections of tas τ i, we use a flag r i,j for each section s i,j of τ i to indicate whether s i,j is a critical section or not. If s i,j is a non-critical section, r i,j = 0; otherwise, r i,j denotes the identification of the resource that τ i accesses during its critical section s i,j. Therefore, we have 0 r i,j R. The subset of resources that are accessed by tas τ i during its execution is denoted as R i ( R). Note that, a tas ay need to access a resource ultiple ties within its different critical sections. 5 4 3 2 1 c2 c1 c3 c4 c5 Fig. 1: An exaple of tass and resource access patterns. As an exaple, consider a syste with five tass Ψ = {τ 1, τ 2, τ 3, τ 4, τ 5 } and two resources R = {R 1, R 2 }. The sections of tass and their resource access patterns are shown R1 p2 p1 p3 p5 p4 R2

4 in Figure 1. Here, we have n 1 = n 2 = 2, n 3 = 5 and n 4 = n 5 = 9. Use tas τ 3 as the exaple, there are r 3,1 = r 3,3 = r 3,5 = 0, r 3,2 = 1 and r 3,4 = 2. Therefore, we have R 1 = R 4 = {R 1 }, R 2 = {R 2 } and R 3 = R 5 = R. 3.2 Schedulability: Partitioned-EDF and MSRP When shared resources are considered, the execution of a tas, regardless of its priority, can be bloced when it attepts to access a resource that is currently held (and accessed) by another tas. Therefore, the exact sequence for tass to access the resources and the schedulability of tass rely on not only the scheduling algorith but also the resource access protocol. In [10], it has been shown that, when there are only global shared resources and non-nested critical sections within tass (as the case being considered in this wor), the global and partitioned scheduling have coparable schedulability. However, considering its relatively siple per-processor based schedulability analysis [20], we focus on partitioned- EDF in this paper. Moreover, for the resource access protocol, although the suspension-based echanis adopted in OMLP can iprove syste efficiency at runtie [11], it cannot iprove the off-line schedulability analysis that needs to consider the worst case scenario. Therefore, we focus on the spin-loc based MSRP resource access protocol, where the basic rules can be suarized as follows [20]: Rule 1: When a tas τ i attepts to access a resource R a, if R a is free (i.e., no other tas is accessing it), it will loc and access the resource by executing its critical section non-preeptively; Otherwise, if R a is currently held by another tas (on a different processor), tas τ i will be added to R a s FIFO queue and the processor enters a non-preeptive busy-wait state; Rule 2: Once a tas τ i finishes accessing a resource R a at the end of a critical section, it will release R a and becoe preeptable again. If R a s FIFO queue is not epty (i.e., there are tass fro other processors waiting for accessing R a ), the queue s header tas will start accessing the resource; otherwise, R a is unloced. Fro the above rules, we can see that the execution of a tas τ i on processor P can be bloced due to synchronization requireents at two different occasions: First, when a low priority tas τ j (which has a deadline later than that of tas τ i ) is accessing or busy-waiting for a resource on the sae processor P, tas τ i is bloced and the duration is denoted as local blocing tie; Second, when τ i tries to access a resource R a that is currently held and accessed by a tas on another processor, it has to wait in R a s FIFO queue and the duration is denoted as global waiting tie. Moreover, for tass accessing shared resources under MSRP, we can have the following properties [20]: Property 1. For any processor at any given tie, there exists at ost one tas that is either a) accessing a resource; or b) busy-waiting for a resource (which is currently held by a tas on another processor). Property 2. A tas can be bloced by a low priority tas on the sae processor at ost once. Moreover, a tas s local blocing tie is upper bounded by the longest duration of any low priority tas on the sae processor accessing (and waiting, if applicable) one of its resources once. Schedulability Condition: For a given partitioning (or apping) of tass to processors Π = {Ψ 1,, Ψ M }, where Ψ is the subset of tass that are allocated to processor P, we first review the schedulability condition for partitioned-edf under MSRP [5], [20]. For the ease of presentation and discussion, soe necessary notations are defined as follows: BW i,x : the axiu global waiting tie that tas τ i can experience when it waits for accessing resource R a in its critical section s i,x (where a = r i,x ) on P ; BW i : the axiu total global waiting tie for tas τ i to access its resources in all its critical sections; B i : the axiu local blocing tie that can be experienced by tas τ i on processor P. Fro Property 2, we now that tas τ i can only be bloced at ost once by another tas τ j on the sae processor when τ j (waits and) accesses a resource in one of its critical sections, where > p i [5], [20]. Therefore, we have B i = ax{bw j,y + c j,y s j,y : τ j Ψ > p i r j,y 0} (1) Taing the local blocing and global waiting ties of tass into consideration, for a given tas-to-processor apping Π, the synchronization-cognizant processor load for P can be defined as: B i (Π) = ax + p i L sc p i c j + BW j τ j Ψ τ i Ψ (2) Correspondingly, we define the synchronization-cognizant syste load as the axiu of all processors synchronizationcognizant loads as given below: L sc (Π) = ax{l sc (Π) P : 1 M} (3) With these definitions, we can directly get the following proposition regarding to the feasibility of a given tas-toprocessor apping based on the results fro [5], [20]: Proposition 1. For a set of periodic real-tie tass running on a ultiprocessor syste with shared resources that are governed by MSRP, a given tas-to-processor apping Π is feasible for partitioned-edf if there is L sc (Π) 1. Global Waiting Tie: For a given tas-to-processor apping Π, the global waiting tie BW i,x of tas τ i for accessing R a in its critical section s i,x can be calculated as [20]: BW i,x = j j=1,,m tp ax j (R a ) (4) where tp ax j (R a ) is the axiu aount of tie for any tas on other processor P j (j ) to access resource R a once. That is, in the worst case, tas τ i ay have to wait for the longest access tie of R a by tass on all other processors.

5 Here, if s i,x is a non-critical section and r i,x = 0, BW i,x = 0. Moreover, tp ax j (R a ) can be further calculated as [20]: tp ax j (R a ) = ax{tt ax i (R a ) τ i Ψ j } tt ax i (R a ) = ax{c i,y s i,y : r i,y = a} where tt ax i (R a ) denotes the axiu aount of tie for tas τ i to access resource R a once; if τ i does not access R a (i.e., R a / R i ), there is tt ax i (R a ) = 0. Then, based on BW i,x, the total global waiting tie BW i for tas τ i ( Ψ ) can be siply accuulated as [20]: n i BW i = BW i,x (5) x=1 3.3 Proble Description Based on Proposition 1, the proble to be addressed in this paper is: for a set of periodic real-tie tass running on a hoogeneous ultiprocessor syste with shared resources under partitioned-edf and MSRP, finding a feasible tas-toprocessor apping Π such that L sc (Π) is iized. Note that, when the syste has no shared resource, the special case of the proble becoes the traditional partitioned real-tie scheduling proble. In [16], [17], it has been shown that finding the optial tas-to-processor apping for such a proble to iize the axiu syste load is NP-hard. Therefore, the apping proble to be studied in this paper is NP-hard as well. Hence, in what follows, we will first develop the utilization bound and then focus on efficient tas apping heuristics that explicitly tae synchronization overhead of tass accessing shared resources into consideration. 4 UTILIZATION BOUND AND ANOMALY For real-tie systes without shared resources, López et al. have investigated the utilization bounds for partitioned-edf with various apping heuristics, such as First-Fit (FF), Best- Fit (BF) and Worst-Fit Decreasing () [32]. As long as the syste utilization of a tas set is no ore than such bounds, the result tas-to-processor appings under these heuristics are guaranteed to be feasible, which can be exploited for efficient schedulability test. Note that, such bounds increase onotonically when there are ore available processors [32]. However, for systes with shared resources, fro Proposition 1 and Equation (3), we can see that the feasibility of a given tas-to-processor apping depends on not only the accuulated tas utilizations on every processor but also the local blocing and global waiting tie (i.e., synchronization overhead) of tass, which can becoe larger when tass are apped to ore processors. Before developing the utilization bound for systes with shared resources, we first illustrate the scheduling anoaly through a concrete exaple. 4.1 Scheduling Anoaly: An Exaple Consider a tas syste with three tass Ψ = {τ 1, τ 2, τ 3 } and two resources R = {R 1, R 2 }. For tas τ 1, it accesses the resource R 1 twice within its two critical sections that have the sae size of 1. Moreover, there are c 1 = 4 and p 1 = 10. Siilarly, tas τ 2 accesses R 1 once within its critical section that has the size of 4 with c 2 = 5 and p 2 = 9; tas τ 3 accesses resource R 2 once within its critical section that has the size of 2 with c 3 = 8 and p 3 = 10. If only two processors are deployed, we can allocate the first two tass on processor P 1 and the third tas on processor P 2. That is, we have Ψ 1 = {τ 1, τ 2 } and Ψ 2 = {τ 3 } for the apping. Fro Equations (1), (4) and (5), we can get BW 1 = BW 2 = BW 3 = 0, B 1 = 0, B 2 = 1 and B 3 = 0. Moreover, fro Equation (2), we can get L sc 1 = ax{ 1 9 + 5 9, 5 9 + 4 10 } = 0.96 and Lsc 2 = ax{ 8 10 } =. Thus, there is L sc = ax{l sc 1, L sc 2 } = 0.96 < 1. Therefore, the apping is feasible under partitioned-edf and MSRP according to Proposition 1. However, when three processors are deployed, there is only one possible apping of tass to processors, where each processor is assigned one tas. Here, we assue that any deployed processor will be apped at least one tas (i.e., Ψ ). Without loss of generality, suppose that the apping is Ψ = {τ } ( = 1, 2, 3). We can get BW 1 = 8, BW 2 = 1, BW 3 = 0 and B 1 = B 2 = B 3 = 0. Furtherore, we can find out that L sc 1 = ax{ 4+8 10 } = 1.2. That is, the apping fails to eet the schedulability condition for partitioned-edf and MSRP according to Proposition 1. Here, the proble coes fro the increased synchronization overhead (i.e., the global waiting tie) when ore processors are deployed. Hence, we can have the following observation. Observation 1. For a set of real-tie tass that access shared resources and are scheduleable under the partitioned-edf and MSRP on a given nuber of processors, when ore processors are deployed, the synchronization between the tass can becoe ore coplex and the tas set could becoe unschedulable due to increased synchronization overhead. 4.2 Synchronization-Cognizant Utilization Bound Next, focusing on the apping heuristic, we develop the utilization bound for ultiprocessor real-tie systes with shared resources. Fro previous discussion, the synchronization overhead of tass rather relies on a specific tas-toprocessor apping. Therefore, to find the upper-bounds for such synchronization overhead and for the ease of presentation, we define the following notations: p : the iu period of tass under consideration; n ax,cs : the axiu nuber of critical sections in a tas for the tas syste under consideration; c ax,cs : the size of the largest critical section for all tass in the tas syste under consideration; BW ub : the upper-bound of the overall global waiting tie that can be experienced by any tas to access all its resources. Fro MSRP and Property 1, we can safely have BW ub = n ax,cs ((M 1) c ax,cs ); B ub : the upper-bound of the local blocing tie that can be experienced by any tas; siilarly, fro Property 2, we can safely have B ub = c ax,cs + (M 1) c ax,cs = M c ax,cs ; α: the axiu synchronization-cognizant utilization of tass, which is defined as α = ax{ ci+bw ub p i τ i Ψ};

6 γ: the upper-bound for utilization loss on any processor due to local blocing tie, which is defined as γ = Bub p ; β: the iu nuber of tass that can feasibly fit into one processor under EDF when synchronization overhead is considered. Fro Proposition 1 and Equations (2) and (3), we have β = 1 γ α. Fro the definition of β, we can easily obtain the following lea regarding to the nuber of tass and syste schedulability with the siilar reasonings as in [32]: Lea 1. For a set of N periodic real-tie tass running on a M-processor syste with shared resources, the tas set is schedulable under partitioned-edf with MSRP if N β M. In what follows, we focus on the cases with N > β M. Note that, with the definition of B ub and BW ub, for any given tas-to-processor apping Π = {Ψ 1,, Ψ M }, the following equation holds for every processor P ( = 1,, M): B ub (Π) ax p + L sc p i ub c j + BW τ j Ψ τ i Ψ Hence, fro Proposition 1, we can get the following lea. Lea 2. For a set Ψ of periodic real-tie tass running on a M-processor syste with shared resources that are scheduled under the partitioned-edf with MSRP, a given tasto-processor apping Π = {Ψ 1,, Ψ M } is feasible, if for every processor P ( = 1,, M), there is: B ub ax p + p j p i ub c j + BW τ j Ψ τ i Ψ 1 (6) Note that, copared to Equation (2), Equation (6) represents a uch relaxed sufficient condition regarding to tass schedulability. In what follows, we derive the synchronizationcognizant utilization bound (U sc,bound ) for systes with shared resources based on Equation (6). That is, as long as the syste utilization of a tas set is no ore than U sc,bound, the result tas-to-processor apping under guarantees that Equation (6) will hold, which further iplies that the tas set is schedulable under partitioned-edf and MSRP. Define σ = ax{γ, BW ub p } as the synchronization overhead factor. Theore 1. For a set of N periodic real-tie tass running on a M-processor syste with shared resources where the nuber of tass N > β M, the synchronization-cognizant utilization bound (U sc,bound ) for partitioned-edf and MSRP under the apping heuristic can be found as: where U sc,bound = {U b1, U b2 } (7) U b1 = β M + 1 (1 σ) (β M + 1) σ (8) 1 + β U b2 M N = (1 σ) N σ (9) M + N 1 Proof: With the focus on, we assue that tass are sorted in non-increasing order of their utilizations. That is, for tass τ i and τ j where 1 i < j N, there is u i u j. Suppose that tas τ n is the first tas that would fail the sufficient condition on every processor represented by Equation (6) should it be allocated to that processor. That is, for every processor P ( = 1,, M), there is: c n + BW ub p n + Bub p + ub c j + BW > 1 τ j Ψ where Ψ contains the subset of tass on processor P after allocating the first (n 1) tass. Note that p p i (i = 1,, n). Fro the definition of σ, the above inequality can be transfored as follows on every processor P : u n + 2σ + u j + Ψ σ > 1 τ j Ψ where Ψ represents the nuber of tass in the subset Ψ. Adding all these M inequalities together, we can get: n (M 1) u n + u j + (2M + n 1) σ > M j=1 Fro the assuption that tass are ordered in n non-increasing j=1 order of their utilizations, there is u n uj n. Thus, the above inequality can be further transfored as: ( ) M 1 n + 1 u j + (2M + n 1) σ > M n j=1 Hence, we have: n n u j > (M (2M + n 1) σ) M + n 1 j=1 Therefore, considering that the syste utilization of the whole tas set U n j=1 u j, we can further have: U > = n (M (2M + n 1) σ) M + n 1 M n (1 σ) n σ = f(n) (10) M + n 1 where f(n) is a function of n. Note that β M < n N. To obtain a positive (and eaningful) utilization bound, we need to have f(n) > 0. That is, 1 σ σ > M + n 1 M M + (β M + 1) 1 M = 1 + β Therefore, we need to have σ < 1 β+2 < 1. Since σ is not related to n, we can get the second derivative of f(n) with respect to n as f 2M (M 1) (1 σ) (n) = (M + n 1) 3 < 0 Therefore, the function f(n) is a concave function with its iu value can be found when either n = β M + 1 or n = N. Note that, f(β M + 1) = β M + 1 (1 σ) (β M + 1) σ = U b1 1 + β f(n) = M N b2 (1 σ) N σ = U M + N 1

7 Hence, the utilization bound U sc,bound will be the iu value of f(n) and U sc,bound = {U b1, U b2 }. Therefore, if the syste utilization of a tas set is no ore than U sc,bound, guarantees to generate a tas-to-processor apping satisfying Equation (6), which concludes the proof. As we entioned earlier, by exploiting the upper-bounds of synchronization overhead, the siplified schedulability condition given in Equation (6) is very loose. Therefore, the synchronization-cognizant utilization bound derived fro such sufficient condition as shown in Equation (7) is rather pessiistic. Specifically, for systes with large synchronization overhead (such as B ub, BW ub and σ), the value of U sc,bound can be very sall, which liits its applicability. On the other hand, when no tas needs to access any resource (or there is no shared resource in the syste), we will have σ = 0. For the function f(n) defined in Equation (10), we can get its first derivative as f (n) = M (M 1) (M+n 1) > 0. 2 Therefore, the iu value of f(n) can be found as β M+1 1+β when n = β M+1, which actually reduces to be the utilization bound for systes without shared resources under [32]. 4.3 Non-Monotonicity of the Bound U sc,bound Different fro the utilization bounds for systes without shared resources (which increase onotonically when the nuber of available processors increases [32]), the utilization bound U sc,bound for systes with shared resources can becoe lower when ore processors are deployed. Fro Equations (7) to (9), we can see that, in addition to the nuber of deployed processors, U sc,bound depends heavily on the synchronization overhead of tass. However, for a given set of real-tie tass, the bounds for such overhead (i.e., B ub and BW ub ) actually becoe larger as the nuber of deployed processors increases. In what follows, we forally analyze such non-onotonicity of the utilization bound U sc,bound. Note that, fro Equation (8), U sc,bound also depends on β, the iu nuber of tass that can feasibly fit into one processor under EDF when synchronization overhead is considered. Recall that β = 1 γ cax,cs M, where γ = α ci+nax,cs c ax,cs (M 1) p i p and α = ax{ τ i Ψ}. That is, β relies on M as well. However, due to the nature of the floor operation, we can have the following lea regarding to the invariance property of β when M changes. Lea 3. For a given set Ψ of real-tie tass that access shared resources, we have β to be a constant I (an integer) when L(I) < M H(I), where p + (n ax,cs c ax,cs c ) (I + 1) L(I) = n ax,cs c ax,cs (I + 1) + p c ax,cs /p (11) p + (n ax,cs c ax,cs c ) I H(I) = n ax,cs c ax,cs I + p c ax,cs /p (12) Here, tas τ s axiu synchronization-cognizant utilization is assued to be c +n ax,cs c ax,cs (M 1) p = α. Proof: Note that, for a given integer I, fro the definition of β, we have β = I when I 1 γ α < (I + 1). Substitute α and γ in the above inequalities, we can get I 1 c ax,cs M/p (c + n ax,cs c ax,cs (M 1))/p < I + 1 With soe transforations, we can obtain that, when L(I) < M H(I), there is β = I, which concludes the proof. Therefore, fro the above lea, we can consider β to be a constant I when M changes within the range of (L(I), H(I)], which further leads to the following theore regarding to the non-onotonicity of the utilization bound U sc,bound. Theore 2. For a given set Ψ of real-tie tass that access shared resources, the synchronization-cognizant utilization bound under partitioned-edf and MSRP as represented in Equation (7) can decrease as the nuber of deployed 1 processors increases when there is 3+β < σ < 1 2+β. Proof: Define ς = ax,cs nax,cs c p ub BW p. Suppose that BW ub (M 1) nax,cs c ax,cs p = B ub. We can have σ = = (M 1) ς. We further define g(m) = U b1 = β M+1 1+β (1 σ) (β M +1) σ and h(m) = U b2 = M N M+N 1 (1 σ) N σ. For g(m), we can get its first derivative with respect to M as (note that β can be considered as a constant according to Lea 3): g (M) = β 1 σ 1 + β ς β M + 1 (β M + 1) ς β σ 1 + β Substitute ς = σ M 1 into the above equation, we can get ( ) σ 1 + β + σ g (M) = β 1 σ 1 + β β σ β M + 1 M 1 Since β M+1 M 1 > β, the above equation can be transfored as g (M) < β (1 4σ 2β σ) 1 + β Here, we now that g 1 (M) < 0 when there is 2 (2+β) < σ. Moreover, fro the proof of Theore 1, we have σ < 1 2+β. 1 Therefore, when there is 2 (2+β) < σ < 1 2+β, we can get that g(m) decreases as M increases. For h(m), its first derivative with respect to M is: ( (N 1) (1 σ) h (M) = N (M + N 1) 2 2M + N 1 ) M + N 1 ς σ M 1 Substitute ς = into the above equation, we can have ( (N 1) (1 σ) h (M) = N (M + N 1) 2 2M + N 1 ) M + N 1 σ M 1 ( ) N < M + N 1 (N 1)σ (1 σ) 2 σ M 1 Since N β M + 1, the above inequality can be further transfored as h N (M) < (1 3 σ β σ) M + N 1 Therefore, we have h 1 (M) < 0 when there is 3+β < σ. Note 1 that there is 2 (2+β) < 1 3+β. Hence, when there is 1 3+β <

8 σ < 1 2+β, it is possible to have both g(m) and h(m) (thus the utilization bound U sc,bound ) decrease as the nuber of deployed processors M increases. For cases where BW ub < B ub, by re-defining ς = cax,cs p, we have σ = M ς. Following the siilar steps, we can also get that both g(m) and h(m) decrease as the nuber of deployed 1 processors increases when there is 3+β < σ < 1 2+β, which concludes the proof. 5 TIGHTENED SYNCHRONIZATION OVERHEAD Based on the upper-bounds of tass synchronization overhead, we developed the utilization bound U sc,bound, which can be exploited for efficient schedulability test for the traditional that relies on tass original utilizations. However, the pessiis nature of U sc,bound greatly liits its applicability. Moreover, without taing the synchronization overhead of tass into consideration, it is very liely for the traditional to fail to obtain feasible tas-to-processor appings for tas sets with higher syste utilizations [24]. On the other hand, the existing approach to calculating synchronization overhead is still loose, especially for the total global waiting tie in Equation (5). Here, before a tas τ i accessing any resource for every of its critical sections, it always assues the worst-case interference fro tass on other processors [20]. Such siplified calculation can result in unnecessarily larger values for tass total global waiting tie, which in turn can falsely reject a feasible tas-to-processor apping based on Proposition 1. 5.1 Liits on Interference aong Tass Recall that we consider synchronous periodic tass where the first job of each tas arrives at tie 0. Therefore, fro [11], we can directly get the following proposition regarding to the interference aong tass under partitioned-edf scheduling. Proposition 2. For any two tass τ i and τ j in a synchronous periodic tas set scheduled under partitioned-edf, the axiu nuber of τ j s jobs that can interfere with the execution of any job of τ i due to accessing shared resources is: θ i,j = 1, p i < od(, p i ) = 0 p i, p i od(p i, ) = 0 pi + 1, otherwise where od(x, y) returns the reainder of dividing x by y. (13) Note that a tas ay access one resource ultiple ties in its different critical sections. Define the set of tas τ i s critical sections where τ i accesses resource R a as S i,a = {s i,j r i,j = a; j = 1,, n i }. Then, fro Proposition 2 and the MSRP protocol [20], we can obtain the proposition below. Proposition 3. For a given tas-to-processor apping Π, the nuber of interference (i.e., global waiting) that can be experienced by any job of tas τ i ( Ψ ) because of accessing resource R a has the following liitations: θ i,j : caused by any tas τ j where τ j / Ψ ; S i,a : caused by tass on any processor P ( ); where S i,a denotes the nuber of critical sections in S i,a. Algorith 1 : Calculate waiting tie BW i (R a ) Input: Ψ, Π (= {Ψ }), τ i Ψ and R a ( R i ); Output: BW i (R a ); 1: BW i (R a ) = 0; liit[] = S i,a ( = 1,, M); 2: for (s j,y S a (Ψ) where τ j Ψ ) do 3: count = {liit[], θ i,j }; 4: BW i (R a )+ = count c j,y ; liit[] = count; 5: end for 5.2 Resource-Oriented Global Waiting Tie Next, we study a resource-oriented approach to tightening the calculation of tass total global wait tie by exploiting the interference liitation aong tass. Such tightened overhead can in turn iprove the acceptance test of given tas-toprocessor appings. For such a purpose, we further define S a (Ψ) as the set of critical sections of all tass in Ψ where resource R a is accessed; that is, S a (Ψ) = τi ΨS i,a. Here, we assue that the critical sections in S a (Ψ) are in descending order of their sizes. For a given tas-to-processor apping Π where tas τ i Ψ, Algorith 1 suarizes the steps to calculate the axiu total global waiting tie BW i (R a ) that can be experienced by any job of tas τ i due to accesses of resource R a ( R i ). Here, based on Proposition 3, the liits on nuber of interference fro processors are first initialized (line 1). Then, in descending order of their sizes, the critical sections in S a (Ψ) are processed one at a tie. Note that, only if a critical section is fro a tas on a processor other than P, can it possibly interfere with tas τ i (line 2). Next, the nuber of interference that a critical section s j,y can put on tas τ i is subject to the liit of its tas τ j as well as the reaining liit fro τ j s processor (line 3). Finally, the global waiting tie BW i (R a ) cuulates and the processor s liit is updated properly (line 4). Considering all resources that are accessed by tas τ i, its axiu total global waiting tie can be given as: BW i = BW i (R a ) (14) R a R i Fro Algorith 1, we can see that, by incorporating the interference liits, the longest critical section (with the size of tp ax (R a )) of tass accessing R a on a processor P ( ) ay not be able to interfere with tas τ i when every tie it accesses R a. Therefore, the axiu total global waiting tie BW i of tas τ i obtained fro such a resource-oriented approach and given by Equation (14) is no ore than that of Equation (5). Note that, the local blocing tie given by Equation (1) rather relies on individual critical sections, which cannot be further reduced. However, with the tightened global waiting tie, the resource-oriented approach can obtain saller values for the synchronization-cognizant processor load given in Equation (2), which can iprove schedulability as shown in Section 7. 6 SYNCHRONIZATION-COGNIZANT MAPPING For partitioned scheduling, there are two ain issues when allocating tass to processors: a) the order (i.e., priority) of

9 Algorith 2 : Outline of SC-TMA Input: Ψ (the tas set) and M (nuber of processors); Output: A feasible apping Π or FAIL; 1: Π = ; L sc, = ; 2: for (K : U M) do 3: Φ = Ψ; Ψ = ( = 1,, K); 4: while (Φ ) do 5: τ i =MaxPriority(Φ, {Ψ 1,, Ψ K });//Section 6.2 6: =FindProcessor(τ i, {Ψ 1,, Ψ K });//Section 6.3 7: Ψ = Ψ {τ i }; Φ = Φ {τ i }; 8: end while 9: Π tp = {Ψ 1,, Ψ K }; 10: Calculate L sc (Π tp );//fro Equations (1-4) and (14) 11: if (L sc (Π tp ) 1 and L sc (Π tp ) < L sc, ) then 12: Π = Π tp ; L sc, = L sc (Π tp ); 13: end if 14: end for 15: Return (Π? Π: FAIL); tass being allocated; and b) the selection of an appropriate target processor for the next tas to be allocated. With different objectives (e.g., to iize the nuber of processors deployed or to balance worload aong processors), various heuristics (such as BFD and [32]) have been studied for ordering the tass and selecting the target processors. In [24], we have studied a synchronization-aware (SA- ) tas apping schee that considers synchronization overhead when ordering and allocating tass to processors. However, the order (i.e., priority) of tass relies on the fixed estiation of their axiu synchronization overhead and the selection of target processor is based on a siple etrics of resource siilarity (defined as the nuber of sae resources accessed by tass). Therefore, SA- can still fail to generate feasible appings for any tas sets [24]. In this wor, based on the tightened resource-oriented global waiting tie of tass, we propose the synchronizationcognizant tas apping algoriths (SC-TMA). The objective is to obtain feasible and load-balanced tas-to-processor appings for ore tas sets with shared resources and thus iprove the schedulability. Specifically, SC-TMA prioritizes and allocates tass based on iteratively updated synchronization overhead of tass that considers the constantly changing liits on interference aong tass during the apping process. 6.1 Overview of SC-TMA Fro Section 4, we now that a tas set with shared resources ay be schedulable on fewer processors but not on ore processors. That is, not all the available M processors ay be utilized in a feasible tas-to-processor apping. Moreover, for a tas set Ψ with syste utilization U, the iu nuber of processors needed to successfully schedule the tass is U. Therefore, in order to find the best apping with the iu syste load and an appropriate nuber of processors, Algorith 2 gives the outline of SC-TMA. Here, we search over all possible nuber of processors (line 2). For each case, we select iteratively the highest Algorith 3 : CalBWMax(τ i, R a, Φ, {Ψ 1,, Ψ K }) Input: τ i Φ, R a ( R i ) and {Ψ 1,, Ψ K }; Output: BWi ax (R a ); 1: BWi ax (R a ) = 0; liit[] = S i,a ( = 1,, K); 2: liit total = (K 1) S i,a ; 3: for (s j,y S a (Ψ)) do 4: if (τ j Ψ ) then 5: count = {liit total, θ i,j, liit[]}; 6: liit[] = count; 7: else 8: count = {liit total, θ i,j, S i,a }; 9: end if 10: BWi ax (R a )+ = count c j,y ; liit total = count; 11: end for priority tas one at a tie, where tass priorities are updated based on current partial apping inforation (line 5), and its corresponding processor (line 6). Once all tass are allocated, we can calculate the synchronization-cognizant syste load of the current apping based on Equations (1-4) and (14). If the current apping is feasible and better than the best apping obtained so far (line 11), the best apping will be updated (line 12). In the end, SC-TMA either fails to find a feasible apping or returns the best feasible one (line 15). 6.2 Prioritization of Unapped Tass Fro Equation (2), we can see that the load of a processor depends heavily on tass synchronization overhead, especially the total global waiting tie of each tas. Following the sae idea of SA- [24], SC-TMA also prioritizes unapped tass based on their estiated synchronization-cognizant utilizations, which is defined as: u sc,estiate i = c i + BWi ax (15) p i where BWi ax denotes the estiated axiu total global waiting tie of tas τ i. However, different fro SA- that utilizes the fixed value of BWi ax obtained fro Equation (5) (i.e., fixed priority of tass) [24], SC-TMA relies on a ore accurate resource-oriented Equation (14) to calculate BWi ax and adjusts it constantly based on the current partial apping inforation. Such considerations, as shown in Section 7, can significantly iprove the result tas-to-processor appings. The detailed steps to estiate the axiu resourceoriented global waiting tie BWi ax (R a ) based on the current partial apping inforation are given in Algorith 3. Here, siilar to Algorith 1, the liitation on the interference aong tass is also incorporated. Note that, not all tass have been apped to processors in the partial apping. Therefore, when K ( M) processors are considered, there is an additional liit on the total nuber of interference that can be experienced by a job of tas τ i due to accesses of resource R a (line 2). Moreover, the interference liits fro apped and unapped tass are differentiated (lines 4 to 9). Once for every unapped tas τ i is obtained according to Equations (14) and (15), the highest priority tas to be apped next will be the one with the largest u sc,estiate i. u sc,estiate i

10 6.3 Selection of A Target Processor To select an appropriate target processor for the highest priority tas τ i to be apped next, we adopt the worst-fit principle as it can norally result in a worload-balanced apping [32]. That is, the tas τ i should be apped to a processor P such that L sc ({Ψ 1,, Ψ {τ i },, Ψ K }) ( = 1,, K) is iized. When there exist ore than one processors that lead to the sae iu synchronization-cognizant syste load, to provide ore space for future tass and increase the chance of obtaining a feasible apping, the one that can iize the lowest processor load should be selected. If there is still a tie, any of the processors can be chosen. 6.3.1 Probe-Based Processor Selection Fro Section 3.2, the allocation of tas τ i to a processor P can affect not only the tass on P due to changes in local blocing tie but also tass on other processors with possibly increased global waiting tie. Therefore, for the case of allocating τ i to P, the loads on all processors (and thus the syste load) need to be updated according to Equations (1-4) and (14) as well as Algorith 1. Note that, there are K possible allocations for tas τ i. Based on the above entioned criteria (i.e., iize syste load and then the lowest processor load), the ost appropriate processor P can thus be selected. This schee is called SC-TMA with probe-based processor selection (denoted as ). Here, by re-calculating the loads on all processors for every possible allocation of τ i, has rather high coplexity. 6.3.2 Quic Processor Selection As a ore efficient schee with reduced coplexity, we introduce another SC-TMA with quic processor selection, which is denoted as SC-TMA-Quic. Essentially, it focuses on only one processor at a tie. That is, for any processor P, there are only two possibilities when allocating tas τ i : either τ i is apped to P or not. Then, the new processor load on P can be estiated for both possibilities as follows. When τ i is assued to be apped to P, its local blocing tie and global waiting tie can be updated accordingly. Then, for the existing tass in Ψ, their global waiting tie will reain unchanged. Moreover, for the local blocing tie, adjustents are only needed for these tass in Ψ that have periods less than that of τ i. Therefore, for the case of τ i being apped to P, based on Equation (2), P s new processor load L sc ({, Ψ {τ i }, }) can be calculated. When τ i is assued to be apped to a processor other than P, for the existing tass in Ψ, their global waiting ties can be estiated by assug that τ i always bring in the axiu nuber of interference on the (siilar to line 3 in Algorith 3) since it is not nown exactly which processor τ i will be on. Then, their local blocing ties can be adjusted accordingly. At the end, P s new processor load L sc ({, Ψ, }) can be re-calculated. Once we get two arrays of new loads for all K processors, two candidates P x and P y will be identified. Here, L sc x ({, Ψ x {τ i }, }) has the iu value and L sc y ({, Ψ y, }) has the axiu value within their respective arrays. Following the above entioned principles, the tas τ i will be allocated to processor P y only if the following two conditions are satisfied: L sc x ({, Ψ x, }) < L sc x ({, Ψ x {τ i }, }) ax{l sc ({, Ψ {τ i }, })} L sc y ({, Ψ y, }) Otherwise, τ i will be allocated to processor P x. 7 EVALUATIONS AND DISCUSSIONS In this section, we evaluate the perforance of the proposed synchronization-cognizant tas apping algorith (SC-TMA) through extensive siulations. For coparison, we ipleented the conventional (which allocates tass solely based on their utilizations without synchronization overhead being considered [32]), our previous synchronization-aware (SA-) [24] and the acrotas-based [34]. Here, as adopts a different synchronization protocol, the calculation of processor loads confors to the principles as presented in [25]. We copare these tas apping schees based on the following perforance etrics: a) Schedulability Ratio, which is defined as the ratio of the nuber of schedulable tas sets over the total nuber of tas sets considered; b) Syste Load (L sc ()) as defined in Equation (3), which incorporate synchronization overhead and essentially indicates the quality of resulting tas-to-processor appings in ters of how close they eet the schedulability sufficient condition stated in K =1 Lsc () K, Proposition 1; and c) Average Processor Load where K refers to the nuber of deployed processors in the resulting apping and L sc () is the processor load as defined in Equation (2), it indicates how well the synchronizationcognizant worload is distributed aong the processors. Here, the syste load and average processor load only account for the schedulable tas sets. 7.1 Siulation Settings There are any factors in the syste that can affect the perforance of the apping algoriths under consideration. In this wor, we vary the following paraeters as suarized in Table 1. The nuber of available processors (M) and the nuber of shared resources in the syste (R). The noralized syste raw utilization without considering synchronization U M cost NSRU =, where U is the syste utilization as defined in Section 3. For tass, we consider the nuber of tass N, the nuber of critical sections in a tas and the critical section ratio CSR (defined as the total length of critical sections of a tas over its WCET). Here, the degree of resource contention aong the tass in the systes is detered by R, CSR as well as the nuber of critical sections per tas n cs i : saller R, larger CSR and nuber of critical sections ean higher resource contention and ore coplex synchronization requireents aong tass; and vise versa. For a given set of M, N SRU, N, CSR and nuber of critical sections per tas, following the siilar steps as in [24], the synthetic tas sets are generated as follows. The