CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

Similar documents
Segment-Fixed Priority Scheduling for Self-Suspending Real-Time Tasks

Real-Time Scheduling. Real Time Operating Systems and Middleware. Luca Abeni

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

Andrew Morton University of Waterloo Canada

Many suspensions, many problems: a review of self-suspending tasks in real-time systems

Task Models and Scheduling

Real-time Scheduling of Periodic Tasks (1) Advanced Operating Systems Lecture 2

Lecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013

Lecture 6. Real-Time Systems. Dynamic Priority Scheduling

Embedded Systems Development

Efficient Power Management Schemes for Dual-Processor Fault-Tolerant Systems

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions

Real-Time and Embedded Systems (M) Lecture 5

Real-Time Systems. Lecture #14. Risat Pathan. Department of Computer Science and Engineering Chalmers University of Technology

Embedded Systems 14. Overview of embedded systems design

On the Design and Application of Thermal Isolation Servers

Probabilistic Preemption Control using Frequency Scaling for Sporadic Real-time Tasks

Analytical Enhancements and Practical Insights for MPCP with Self-Suspensions. Pratyush Patel, Iljoo Baek, Hyoseung Kim*, Raj Rajkumar

Uniprocessor Mixed-Criticality Scheduling with Graceful Degradation by Completion Rate

The Concurrent Consideration of Uncertainty in WCETs and Processor Speeds in Mixed Criticality Systems

EDF Feasibility and Hardware Accelerators

Improved Priority Assignment for the Abort-and-Restart (AR) Model

Real-Time Scheduling and Resource Management

A Dynamic Real-time Scheduling Algorithm for Reduced Energy Consumption

Scheduling Slack Time in Fixed Priority Pre-emptive Systems

Schedulability analysis of global Deadline-Monotonic scheduling

Dependency Graph Approach for Multiprocessor Real-Time Synchronization. TU Dortmund, Germany

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund

On the Energy-Aware Partitioning of Real-Time Tasks on Homogeneous Multi-Processor Systems

Schedulability and Optimization Analysis for Non-Preemptive Static Priority Scheduling Based on Task Utilization and Blocking Factors

Scheduling Periodic Real-Time Tasks on Uniprocessor Systems. LS 12, TU Dortmund

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation*

Schedulability Analysis for the Abort-and-Restart Model

Energy-Efficient Scheduling for Hybrid Tasks in Control Devices for the Internet of Things

Efficient TDM-based Arbitration for Mixed-Criticality Systems on Multi-Cores

Real-Time Calculus. LS 12, TU Dortmund

Process Scheduling for RTS. RTS Scheduling Approach. Cyclic Executive Approach

Exact speedup factors and sub-optimality for non-preemptive scheduling

2.1 Task and Scheduling Model. 2.2 Definitions and Schedulability Guarantees

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

Real-Time Systems. Event-Driven Scheduling

EDF Scheduling. Giuseppe Lipari May 11, Scuola Superiore Sant Anna Pisa

Energy-Constrained Scheduling for Weakly-Hard Real-Time Systems

There are three priority driven approaches that we will look at

Failure Tolerance of Multicore Real-Time Systems scheduled by a Pfair Algorithm

Bursty-Interference Analysis Techniques for. Analyzing Complex Real-Time Task Models

Slack Reclamation for Real-Time Task Scheduling over Dynamic Voltage Scaling Multiprocessors

The preemptive uniprocessor scheduling of mixed-criticality implicit-deadline sporadic task systems

ENERGY EFFICIENT TASK SCHEDULING OF SEND- RECEIVE TASK GRAPHS ON DISTRIBUTED MULTI- CORE PROCESSORS WITH SOFTWARE CONTROLLED DYNAMIC VOLTAGE SCALING

Real-Time Systems. Event-Driven Scheduling

Real-time operating systems course. 6 Definitions Non real-time scheduling algorithms Real-time scheduling algorithm

Aperiodic Task Scheduling

Paper Presentation. Amo Guangmo Tong. University of Taxes at Dallas January 24, 2014

Paper Presentation. Amo Guangmo Tong. University of Taxes at Dallas February 11, 2014

Energy-efficient scheduling

Optimal Utilization Bounds for the Fixed-priority Scheduling of Periodic Task Systems on Identical Multiprocessors. Sanjoy K.

Cache-Aware Compositional Analysis of Real- Time Multicore Virtualization Platforms

EDF Scheduling. Giuseppe Lipari CRIStAL - Université de Lille 1. October 4, 2015

Schedulability of Periodic and Sporadic Task Sets on Uniprocessor Systems

Shedding the Shackles of Time-Division Multiplexing

Reliability-Aware Power Management for Real-Time Embedded Systems. 3 5 Real-Time Tasks... Dynamic Voltage and Frequency Scaling...

An Energy-Efficient Semi-Partitioned Approach for Hard Real-Time Systems with Voltage and Frequency Islands

Schedulability Analysis of the Linux Push and Pull Scheduler with Arbitrary Processor Affinities

Clock-driven scheduling

Online Energy-Aware I/O Device Scheduling for Hard Real-Time Systems with Shared Resources

An Optimal k-exclusion Real-Time Locking Protocol Motivated by Multi-GPU Systems

Power-Aware Scheduling of Conditional Task Graphs in Real-Time Multiprocessor Systems

CIS 4930/6930: Principles of Cyber-Physical Systems

TEMPORAL WORKLOAD ANALYSIS AND ITS APPLICATION TO POWER-AWARE SCHEDULING

Scheduling Stochastically-Executing Soft Real-Time Tasks: A Multiprocessor Approach Without Worst-Case Execution Times

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

Generalized Network Flow Techniques for Dynamic Voltage Scaling in Hard Real-Time Systems

A 2-Approximation Algorithm for Scheduling Parallel and Time-Sensitive Applications to Maximize Total Accrued Utility Value

Scheduling Algorithms for Multiprogramming in a Hard Realtime Environment

Reservation-Based Federated Scheduling for Parallel Real-Time Tasks

Lecture Note #6: More on Task Scheduling EECS 571 Principles of Real-Time Embedded Systems Kang G. Shin EECS Department University of Michigan

CEC 450 Real-Time Systems

Design and Analysis of Time-Critical Systems Response-time Analysis with a Focus on Shared Resources

Load Regulating Algorithm for Static-Priority Task Scheduling on Multiprocessors

MIRROR: Symmetric Timing Analysis for Real-Time Tasks on Multicore Platforms with Shared Resources

Network Flow Techniques for Dynamic Voltage Scaling in Hard Real-Time Systems 1

Maximizing Rewards for Real-Time Applications with Energy Constraints

Networked Embedded Systems WS 2016/17

p-yds Algorithm: An Optimal Extension of YDS Algorithm to Minimize Expected Energy For Real-Time Jobs

Real Time Operating Systems

Mitra Nasri* Morteza Mohaqeqi Gerhard Fohler

Exploiting Precedence Relations in the Schedulability Analysis of Distributed Real-Time Systems

Criticality-Aware Partitioning for Multicore Mixed-Criticality Systems

Non-preemptive Fixed Priority Scheduling of Hard Real-Time Periodic Tasks

arxiv: v1 [cs.os] 6 Jun 2013

Component-Based Software Design

Multi-core Real-Time Scheduling for Generalized Parallel Task Models

RUN-TIME EFFICIENT FEASIBILITY ANALYSIS OF UNI-PROCESSOR SYSTEMS WITH STATIC PRIORITIES

Supplement of Improvement of Real-Time Multi-Core Schedulability with Forced Non- Preemption

CPU SCHEDULING RONG ZHENG

Scheduling. Uwe R. Zimmer & Alistair Rendell The Australian National University

Time and Schedulability Analysis of Stateflow Models

Deadline-driven scheduling

3. Scheduling issues. Common approaches 3. Common approaches 1. Preemption vs. non preemption. Common approaches 2. Further definitions

Real-time scheduling of sporadic task systems when the number of distinct task types is small

Transcription:

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators Sandeep D souza and Ragunathan (Raj) Rajkumar Carnegie Mellon University

High (Energy) Cost of Accelerators Modern-day CPS o real-time decision making GP-GPUs, ASICs, FPGAs, DSPs o high computational power o with significant energy requirements Energy Budget is limited Focus on reducing hardware-accelerator power consumption 2

Outline Motivation Background & Related Work Schedulability Analysis Energy Management CycleSolo: Uniprocessor + Accelerator CycleTandem: Uniprocessor + Accelerator Multi-core Extensions Results Conclusion

System Model Sporadic Hard Real-Time Tasks τ i : C i, G i, D i, T i C i : CPU WCET of any job of task τ i G i : Accelerator WCET of any job of task τ i G i e : Accelerator Execution G i m : CPU intervention D i : Deadline T i : Period (Minimum inter-arrival time) Fixed-priority and Fully-Partitioned Multi-core Scheduling *Kim et al. 2017 Assumptions: Each task has a single accelerator segment, tasks selfsuspend on the CPU & the accelerator is non-preemptive 4

Related Work: Schedulability Analysis Lock-based Approach o lock to arbitrate accelerator access o [Elliott et al. 2012][Elliott et al. 2013] [Chen et al. 2016][Patel et al. 2018] Server-based Approach o server accesses accelerator on behalf of tasks o [Kim et al. 2017] Utilize the lock-based approach with MPCP [Rajkumar 1990] proposed techniques can be extended to the server-based approach 5

MPCP-based Analysis W 0 i = C i + G i + B i, W k+1 i = C i + G i + B i + σ i 1 h=1 I i,h B i : worst-case blocking, I i,h : worst-case preemption by τ h on τ i Blocking Calculation: No exact analysis (pessimistic) o request-driven approach uses number of accelerator requests when task blocks o job-driven approach uses number of jobs which arrive during task response time o hybrid analysis best of both worlds - with self-suspensions Single Accelerator segment per-task request-driven approach dominates the job-driven approach 6

CMOS Energy Model Energy Model: P total = P dynamic + P static Dynamic Switching Power o P dynamic = K C L V 2 dd f o reduced using Voltage and Frequency Scaling (VFS) Static Leakage Power o P static = V dd I leakage o reduced using low-power sleep states power gating and/or clock gating Most GPUs (and hardware accelerators) only expose VFS to users focus on reducing frequency to reduce power consumption 7

Related Work: RT Energy Management VFS-based: [Saewong and Rajkumar 03][Hakan and Yang 03] [Devadas and Aydin 10][Kandhalu et al. 11] Sleep-state based: [Chang, Gabow and Khuller 12][Rowe et al. 2010] [Fu et al. 15][Dsouza et al. 2016] GPU Energy Management o MERLOT [Santriaji and Hoffmann 2018] hardware approach, dynamically exploit GPU slack to reduce frequency does not consider schedulability, CPU+GPU execution Focus Required: on static analytical frequency framework scaling (taskset-wide to determine single energy-efficient frequency) CPU + hardware accelerator more frequencies, predictable while operation guaranteeing schedulability 8

Outline Motivation Background & Related Work CycleSolo: Uniprocessor CycleSolo Variants SysClock [Saewong et al. 2003] Ratchet Search CycleTandem: Uniprocessor Multi-core Extensions Results Conclusion

CycleSolo: Variants Uniprocessor + Hardware Accelerator combination o only one frequency can be set CycleSolo-CPU o only the CPU frequency can be scaled CycleSolo-Accelerator o only the accelerator frequency can be scaled CycleSolo-ID o the CPU & accelerator can only be scaled by a common factor To To minimize find theenergy smallest consumption frequency Find the Find smallest the maximum frequency slack which available guarantees in the schedule schedulability 10

CycleSolo: Impact of Blocking Tasks block while waiting to access the non-preemptive accelerator Undefined Critical Instant o Does not occur when all high-priority tasks come in together o Due to blocking and self-suspensions Existing analysis assume the same critical instant, add pessimism by: o considering the worst-case blocking o modeling self-suspensions as release jitter All analyses are pessimistic, none are exact We can only find the smallest frequency which guarantees schedulability, given a schedulability-analysis technique 11

CycleSolo: Impact of Self-Suspensions SysClock*: Independent Tasks, No blocking & self-suspensions o Calculate slack at each scheduling instant (job arrivals, deadlines) In the presence of self-suspensions o Slack calculation (frequency calculation) depends on interference and blocking response time and WCET of high-priority tasks^ operating frequency In the presence of self-suspension model the pessimism as effective scheduling instants ^Chen et al. 16 *Saewong & Rajkumar 03 12

Effective Scheduling Instants W k+1 i = C i + G i + B i + σ i 1 h=1 I i,h Interference of τ h on τ i : I i,h = α i,h E h o α i,h = ceil((w i + W h E h )/T h ) depends on high-priority response time & WCET For τ i, from an analysis perspective o It appears that τ h which self-suspends arrives at: o S h : = j T h W h E h j > 0} Self-Suspension W h E h τ h 1,5 Effective Scheduling Instant 0 5 10 Calculate Slack at all Effective Scheduling Instants 13

CycleSolo: Key Ideas Find a tight range f low, f high in which the best frequency f min lies Perform a binary search over the range o Yields the lowest frequency guaranteeing schedulability How it works: o Consider tasks in decreasing order of priority i i o For each τ i compute a range f low, f high containing the lowest frequency ensures that τ i and all its higher-priority tasks are schedulable Iteratively refine the range as we go through the tasks Ratchet Search 14

Ratchet Search Consider task τ i, initial range f i 1 i 1 low, f high i 1 o Assume high-priority tasks are running at f high minimizes interference, guarantees schedulability o Slack calculation yields frequency f i τ i schedulable with frequency f i when the high-priority interference is minimized 15

Ratchet Search: Range Update i 1 Case 1: f i < f low o At least one task misses deadlines at f i o f min i f i 1 i 1 low, f high No change to the range Case 2: f i 1 low < f i i 1 < f high o τ i misses deadlines at frequencies < f i o f min i f i i 1, f high Case 3: f i i 1 > f high Lower bound ratcheted up i 1 o τ i not schedulable at frequencies f high f max f high f f i high Frequency Range f i f low f i 0 f f low low o f min i f i 1 high, f i Lower & Upper bounds ratcheted up Range bounds always ratcheted up 16

CycleSolo: Putting it all together CycleSolo-CPU Example: τ 1 : C 1 = 10, G 1 = 8, T 1 = 50, τ 2 : C 2 = 20, G 2 = 5, T 2 = 80 Initial CPU Frequency Range set to: o f init init low = 0. 45, f high = 0. 45 (CPU Utilization) Consider effective scheduling points for τ 1 o τ 1 50 f high = 0. 45 CPU Frequency Range f max f low = 0. 45 Consider τ 1 α 50 1 = = 0.27 50 G 1 G 2 f min 1 = min α 50 1 = 0. 27 C 1 Case 1: f i < f low No Change CPU Frequency = 0.27 f 1 min = 0. 27 0 τ 1 10,8,50 τ 2 20,5,80 Accelerator 0 50 80 100 17

CycleSolo: Putting it all together CycleSolo-CPU Example: τ 1 : C 1 = 10, G 1 = 8, T 1 = 50, τ 2 : C 2 = 20, G 2 = 5, T 2 = 80 CPU Frequency Range set to: o 0 f low 0 = 0. 45, f high = 0. 45 Consider effective scheduling points for τ 2 o τ 2 42 and 80 Consider τ 2 α 80 1 = 2 C 1+C f 2 80 G 1cpu = 0.59 G 2 f min 1 = min α 80 1 = 0. 59 τ 1 10,8,50 τ 2 20,5,80 Accelerator solo [0. 45, 0. 59] Case No 3: Slack f i > in fthe high Binary Search f min schedule = Ratchet 0. 59 until Both t=53 Bounds > 42 Effective Scheduling Instant of τ 1 CPU Frequency = 0.59 f min high 1 = 0. 59 f high = 0. 45 Same Concepts Applicable for CycleSolo-Accelerator and CycleSolo-ID CPU Frequency Range f max 0 42 50 53 80 100 0 f low = 0. 45 18

Outline Motivation Background & Related Work CycleSolo: Uniprocessor CycleTandem: Uniprocessor See-Saw Theorem Slack Squeezing Multi-core Extensions Conclusion

CycleTandem Uniprocessor + Hardware Accelerator combination o each can have their own frequency Slack in the system is finite: o relationship between the CPU & accelerator frequency every CPU frequency min. accelerator frequency o guaranteeing schedulability o and vice versa To minimize energy consumption Find the frequency pair (f cpu, f acc ) which minimizes energy while guaranteeing schedulability 20

Feasible Frequencies Lowest Feasible CPU frequency o returned by CycleSolo-CPU: fsolo cpu Lowest Feasible Accelerator frequency o returned by CycleSolo-Acc: fsolo acc high Highest Useful CPU Frequency f cpu solo o Lowest CPU frequency corresponding to accelerator frequency f acc high Highest Useful Accelerator Frequency f acc solo o Lowest Accelerator frequency corresponding to CPU frequency f cpu Energy-Optimal CPU Frequency f opt cpu [f solo cpu, f high cpu ] Energy-Optimal Accelerator Frequency f opt acc [f solo acc, f acc CycleSolo helps bootstrap CycleTandem high ] 21

CycleTandem: See-Saw Theorem The CPU and accelerator frequencies cannot be o simultaneously lesser than f solo id (CycleSolo-ID) o to ensure schedulability f solo id : CycleSolo-ID Frequency o Lowest Common CPU and Accelerator Frequency CPU frequency increases o Accelerator frequency decreases o and vice versa Mapping between f cpu f acc (and vice versa) solo f id Find the optimal CPU frequency f cpu or Accelerator frequency f acc which minimizes energy search the feasible range 22

CPU efficient Accelerator efficient CycleTandem: Slack Squeezing For a given δ increase in CPU (accelerator) frequency f cpu (f acc ) if the accelerator (CPU) frequency decreases by δ > δ then the accelerator (CPU) o Squeezes slack more efficiently than the CPU (accelerator) Energy is a non-linear & non-convex function of the accelerator & CPU frequencies 23

CycleTandem: Key Idea Non-Linear Non-Convex Energy Function o difficult to obtain the best solution o search the feasible range using a heuristic How it works: o Compute the feasible CPU frequency range (CycleSolo-CPU) o Compute the feasible Accelerator frequency range (CycleSolo-Accelerator) o Search (heuristic or brute force) over the smaller of the two ranges Greedy-search heuristic: (1) Choose the endpoint with the lower energy, (2) increase/decrease the frequency till a local minimum is reached 24

Outline Motivation Background & Related Work CycleSolo: Uniprocessor CycleTandem: Uniprocessor Multi-core Extensions Assumptions Extensions SyncAware-WFD Results Conclusion

Multi-Core Assumptions all CPU cores are in the same power domain can only be set to the same frequency fully-partitioned scheduling If each core has its own frequency Response Time of task on a core can depend on frequency of another core impact of self suspensions on remote blocking 26

Multi-Core Extensions Key Observations: o RatchetSearch still applicable CycleSolo works o See-Saw Theorem still holds CycleTandem works Difference: o Interference only by tasks on the same core o Remote Blocking by tasks on other core CycleSolo & CycleTandem can be used for Multi-Core Processors, which have a single power domain across all CPU cores 27

The Impact of Task Partitioning All CPU cores can only be set to the same frequency o Load balancing is useful o Worst-Fit Decreasing (WFD) known to yield balanced partitions Blocking & Self-Suspensions o Effects can be felt by tasks on other cores Sync-Aware WFD: based on [Lakshmanan et al. 2011] o Load Balance while constraining self-suspending tasks to ψ = ceil γ acc m cores o γ acc is the fraction of the CPU load belonging to self-suspending tasks Sync-Aware WFD can restrict the schedulability pessimism of blocking and self-suspensions to a few cores 28

Outline Motivation Background & Related Work CycleSolo: Uniprocessor CycleTandem: Uniprocessor Multi-core Extensions Results Analytical Results TX2 Results Conclusion

Analytical Results Randomly-generated Tasksets: UUnifast-Discard Energy: Compared to without Energy Management o CycleTandem : up to 71.88% lower o CycleSolo-ID : up to 71.03% lower o CycleSolo-Accel : up to 63.41% lower o CycleSolo-CPU : up to 27.42% lower CycleTandem greedy-search heuristic vs brute force: o Worst-case 1.53% greater energy than brute force CycleSolo and CycleTandem can deliver significant energy savings 30

Multi-core Results: Partitioning 4-core processor considered CycleTandem: WFD vs SyncAware-WFD o Schedulability : SyncAware-WFD 6.3% more tasksets o Energy-Savings: SyncAware-WFD up to 3.3% greater SyncAware-WFD yields marginally better schedulability and energy savings than WFD, with the same algorithmic complexity 31

Real-Platform Evaluation: NVIDIA TX2 4-core ARM processor, 256-core Integrated GPU o CPU cores can only be set to the same frequency o GPU frequency can be set independently Energy: Compared to without Energy Management o CycleTandem : up to 44.29% lower 1.78x battery life o CycleSolo-ID : up to 44.18% lower o CycleSolo-Accel : up to 7.81% lower o CycleSolo-CPU : up to 32.01% lower Significant real-world energy savings are possible 32

Conclusion CycleSolo: Slowest Speed is Best o scenarios where only the CPU/accelerator/common frequency can be set o slack computation in the presence of blocking and self-suspensions CycleTandem: Better frequency pair, lower energy o CPU and accelerator frequency can be independently set o non-convex optimization problem: search the feasible range Multi-Core Extensions: o for scenarios where all CPU cores have the same frequency o CycleSolo & CycleTandem are still applicable Analysis framework for Energy-Saving Scheduling with Hardware Accelerators lays the groundwork for analyzing more complex scenarios 33

Thank You! Questions?