Energy-aware checkpointing of divisible tasks with soft or hard deadlines

Similar documents
Resilient and energy-aware algorithms

Energy-efficient scheduling

Energy-aware checkpointing of divisible tasks with soft or hard deadlines

On scheduling the checkpoints of exascale applications

Failure Tolerance of Multicore Real-Time Systems scheduled by a Pfair Algorithm

Reclaiming the energy of a schedule: models and algorithms

Optimal checkpointing periods with fail-stop and silent errors

Checkpoint Scheduling

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

Tradeoff between Reliability and Power Management

How to deal with uncertainties and dynamicity?

Approximation algorithms for energy, reliability and makespan optimization problems

Embedded Systems Design: Optimization Challenges. Paul Pop Embedded Systems Lab (ESLAB) Linköping University, Sweden

Efficient checkpoint/verification patterns

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

Multiprocessor Scheduling I: Partitioned Scheduling. LS 12, TU Dortmund

Scheduling of Frame-based Embedded Systems with Rechargeable Batteries

Divisible Load Scheduling

Lecture 13. Real-Time Scheduling. Daniel Kästner AbsInt GmbH 2013

Real-time Scheduling of Periodic Tasks (1) Advanced Operating Systems Lecture 2

TOWARD THE PLACEMENT OF POWER MANAGEMENT POINTS IN REAL-TIME APPLICATIONS

On the Complexity of Mapping Pipelined Filtering Services on Heterogeneous Platforms

Energy-aware scheduling under reliability and makespan constraints

Online Work Maximization under a Peak Temperature Constraint

Efficient checkpoint/verification patterns for silent error detection

Scheduling algorithms for heterogeneous and failure-prone platforms

Real-Time and Embedded Systems (M) Lecture 5

Embedded Systems Development

Mapping tightly-coupled applications on volatile resources

SPEED SCALING FOR ENERGY AWARE PROCESSOR SCHEDULING: ALGORITHMS AND ANALYSIS

Energy-efficient Mapping of Big Data Workflows under Deadline Constraints

Polynomial Time Algorithms for Minimum Energy Scheduling

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Enhancing Multicore Reliability Through Wear Compensation in Online Assignment and Scheduling. Tam Chantem Electrical & Computer Engineering

Checkpointing strategies for parallel jobs

Continuous heat flow analysis. Time-variant heat sources. Embedded Systems Laboratory (ESL) Institute of EE, Faculty of Engineering

CIS 4930/6930: Principles of Cyber-Physical Systems

Fine Grain Quality Management

Maximizing Rewards for Real-Time Applications with Energy Constraints

Algorithms for Temperature-Aware Task Scheduling in Microprocessor Systems

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

ENERGY-EFFICIENT REAL-TIME SCHEDULING ALGORITHM FOR FAULT-TOLERANT AUTONOMOUS SYSTEMS

Spatio-Temporal Thermal-Aware Scheduling for Homogeneous High-Performance Computing Datacenters

Scheduling Algorithms for Multiprogramming in a Hard Realtime Environment

Lecture 6. Real-Time Systems. Dynamic Priority Scheduling

CycleTandem: Energy-Saving Scheduling for Real-Time Systems with Hardware Accelerators

Scheduling Lecture 1: Scheduling on One Machine

Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors

Embedded Systems 15. REVIEW: Aperiodic scheduling. C i J i 0 a i s i f i d i

Scheduling. Uwe R. Zimmer & Alistair Rendell The Australian National University

Checkpointing strategies with prediction windows

On the Optimal Recovery Threshold of Coded Matrix Multiplication

Impact of fault prediction on checkpointing strategies

Real-time Systems: Scheduling Periodic Tasks

There are three priority driven approaches that we will look at

Reclaiming the energy of a schedule: models and algorithms

Efficient Power Management Schemes for Dual-Processor Fault-Tolerant Systems

Real-time Scheduling of Periodic Tasks (2) Advanced Operating Systems Lecture 3

Load Regulating Algorithm for Static-Priority Task Scheduling on Multiprocessors

Non-Preemptive and Limited Preemptive Scheduling. LS 12, TU Dortmund

Fault Tolerate Linear Algebra: Survive Fail-Stop Failures without Checkpointing

Embedded Systems 14. Overview of embedded systems design

Fault Tolerance. Dealing with Faults

Scheduling computational workflows on failure-prone platforms

STATISTICAL PERFORMANCE

Thermal Scheduling SImulator for Chip Multiprocessors

Optimizing Fault Tolerance for Real-Time Systems

Segment-Fixed Priority Scheduling for Self-Suspending Real-Time Tasks

Power-aware Manhattan routing on chip multiprocessors

Optimal DPM and DVFS for Frame-Based Real-Time Systems

Networked Embedded Systems WS 2016/17

EDF Feasibility and Hardware Accelerators

Reliability-Aware Power Management for Real-Time Embedded Systems. 3 5 Real-Time Tasks... Dynamic Voltage and Frequency Scaling...

Multiprocessor Scheduling II: Global Scheduling. LS 12, TU Dortmund

On the performance of greedy algorithms for energy minimization

Schedulability analysis of global Deadline-Monotonic scheduling

Static Load-Balancing Techniques for Iterative Computations on Heterogeneous Clusters

Checkpointing strategies with prediction windows

A Dynamic Real-time Scheduling Algorithm for Reduced Energy Consumption

Multiprocessor Scheduling of Age Constraint Processes

Reliability-aware energy optimization for throughput-constrained applications on MPSoC

Priority-driven Scheduling of Periodic Tasks (1) Advanced Operating Systems (M) Lecture 4

Upper and Lower Bounds on the Number of Faults. a System Can Withstand Without Repairs. Cambridge, MA 02139

Shared Recovery for Energy Efficiency and Reliability Enhancements in Real-Time Applications with Precedence Constraints

AS computer hardware technology advances, both

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

CSE140L: Components and Design Techniques for Digital Systems Lab. Power Consumption in Digital Circuits. Pietro Mercati

Adaptive and Power-Aware Resilience for Extreme-scale Computing

Scheduling Lecture 1: Scheduling on One Machine

Design of Real-Time Software

Saving Energy in the LU Factorization with Partial Pivoting on Multi-Core Processors

CPU SCHEDULING RONG ZHENG

Scheduling Parallel Iterative Applications on Volatile Resources

Chap 4. Software Reliability

Multiprocessor Energy-Efficient Scheduling for Real-Time Tasks with Different Power Characteristics

Power Minimization for Parallel Real-Time Systems with Malleable Jobs and Homogeneous Frequencies

Speed Scaling to Manage Temperature

Combining Shared Coin Algorithms

Transcription:

Energy-aware checkpointing of divisible tasks with soft or hard deadlines Guillaume Aupy 1, Anne Benoit 1,2, Rami Melhem 3, Paul Renaud-Goud 1 and Yves Robert 1,2,4 1. Ecole Normale Supérieure de Lyon, France 2. Institut Universitaire de France 3. University of Pittsburgh, USA 4. University of Tennessee Knoxville, USA Anne.Benoit@ens-lyon.fr http://graal.ens-lyon.fr/~abenoit/ International Green Computing Conference 2013 Arlington, USA Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 1/ 25

Divisible load scheduling and resilience Divisible load scheduling: divide a computational workload into chunks Arbitrary number of chunks Size of chunks freely chosen by user Goal: minimize makespan, i.e., total execution time Current platforms: increasing frequency of failures Well-established method to deal with failures: checkpointing Take a checkpoint at the end of each chunk and verify result Re-execution in case of transient failure Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 2/ 25

Divisible load scheduling and resilience Divisible load scheduling: divide a computational workload into chunks Arbitrary number of chunks Size of chunks freely chosen by user Goal: minimize makespan, i.e., total execution time Current platforms: increasing frequency of failures Well-established method to deal with failures: checkpointing Take a checkpoint at the end of each chunk and verify result Re-execution in case of transient failure Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 2/ 25

Energy: a crucial issue IGCC: Green Computing Conference! Real need to reduce energy dissipation in current processors Processor running at speed s: power s 3 watts Dynamic voltage and frequency scaling techniques (DVFS) Our goal: minimize energy consumption including that of checkpointing and re-execution (if failure) while enforcing a bound on execution time Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 3/ 25

Energy: a crucial issue IGCC: Green Computing Conference! Real need to reduce energy dissipation in current processors Processor running at speed s: power s 3 watts Dynamic voltage and frequency scaling techniques (DVFS) Our goal: minimize energy consumption including that of checkpointing and re-execution (if failure) while enforcing a bound on execution time Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 3/ 25

Outline 1 Framework 2 With a single chunk 3 With several chunks 4 Simulation results 5 Conclusion Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 4/ 25

Framework Execution of a divisible task (W operations) Failures may occur Transient faults Resilience through checkpointing Objective: minimize expected energy consumption E(E), given a deadline bound D Probabilistic nature of failure hits: expectation of energy consumption is natural (average cost over many executions) Deadline bound: two relevant scenarios (soft or hard deadline) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 5/ 25

Soft vs hard deadline Soft deadline: met in expectation, i.e., E(T ) D (average response time) Hard deadline: met in the worst case, i.e., T wc D VS Hard (worst-case) Soft (expected) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 6/ 25

Execution time, one single chunk One single chunk of size W Checkpoint overhead: execution time T C Instantaneous failure rate: λ First execution at speed s: T exec = W s + T C Failure probability: P fail = λt exec = λ( W s + T C ) In case of failure: re-execute at speed σ: T reexec = W σ And we assume success after re-execution + T C E(T ) = T exec + P fail T reexec = ( W s + T C ) + λ( W s + T C )( W σ + T C ) T wc = T exec + T reexec = ( W s + T C ) + ( W σ + T C ) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 7/ 25

Energy consumption, one single chunk One single chunk of size W Checkpoint overhead: energy consumption E C First execution at speed s: W s s 3 + E C = Ws 2 + E C Re-execution at speed σ: W σ 2 + E C, with probability P fail (Pfail = λt exec = λ( W s + T C ) ) E(E) = (Ws 2 + E C ) + λ ( W s + T C ) ( W σ 2 + E C ) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 8/ 25

Multiple chunks Execution times: sum of execution times for each chunk (worst-case or expected) Expected energy consumption: sum of expected energy for each chunk Coherent failure model: consider two chunks W 1 + W 2 = W Probability of failure for first chunk: Pfail 1 = λ( W 1 s + T C ) For second chunk: Pfail 2 = λ( W 2 s + T C ) With a single chunk of size W : P fail = λ( W s + T C ), differs from Pfail 1 + P2 fail only because of extra checkpoint Trade-off: many small chunks (more T C to pay, but small re-execution cost) vs few larger chunks (fewer T C, but increased re-execution cost) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 9/ 25

Optimization problem Decisions that should be taken before execution: Chunks: how many (n)? which sizes (W i for chunk i)? Speeds of each chunk: first run (s i )? re-execution (σ i )? Input: W, T C (checkpointing time), E C (energy spent for checkpointing), λ (instantaneous failure rate), D (deadline) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 10/ 25

Optimization problem Decisions that should be taken before execution: Chunks: how many (n)? which sizes (W i for chunk i)? Speeds of each chunk: first run (s i )? re-execution (σ i )? Input: W, T C (checkpointing time), E C (energy spent for checkpointing), λ (instantaneous failure rate), D (deadline) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 10/ 25

Optimization problem Decisions that should be taken before execution: Chunks: how many (n)? which sizes (W i for chunk i)? Speeds of each chunk: first run (s i )? re-execution (σ i )? Input: W, T C (checkpointing time), E C (energy spent for checkpointing), λ (instantaneous failure rate), D (deadline) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 10/ 25

Models Chunks Single chunk of size W Speed per chunk VS VS Multiple chunks (n and W i s) Single speed (s) Multiple speeds (s and σ) Deadline bound VS Hard (T wc D) Soft (E(T ) D) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 11/ 25

Outline 1 Framework 2 With a single chunk 3 With several chunks 4 Simulation results 5 Conclusion Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 12/ 25

Single chunk and single speed Consider first that s = σ (single speed): need to find optimal speed E(E) is a function of s: E(E)(s) = (Ws 2 + E C )(1 + λ( W s + T C )) Lemma: this function is convex and has a unique minimum s (function of λ, W, E c, T c ) s = λw 6(1+λT C ) ( (3 3 where a = λe C ( 2(1+λTC ) λw 27a 2 4a 27a+2) 1/3 2 1/3 (3 3 ) 2 E(T ) and T wc : decreasing functions of s ) 2 27a 1/3 1, 2 4a 27a+2) 1/3 Minimum speed s exp and s wc required to match deadline D (function of D, W, T c, and λ for s exp ) Optimal speed: maximum between s and s exp or s wc Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 13/ 25

Single chunk and multiple speeds Consider now that s σ (multiple speeds): two unknowns E(E) is a function of s and σ: E(E)(s, σ) = (Ws 2 + E C ) + λ( W s + T C )(W σ 2 + E C ) Lemma: energy minimized when deadline tight (both for wc and exp) σ expressed as a function of s: σ exp = λw D, (1+λT W C ) σwc = W (D 2T C )s W s +T s C Minimization of single-variable function, can be solved numerically (no expression of optimal s) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 14/ 25

Outline 1 Framework 2 With a single chunk 3 With several chunks 4 Simulation results 5 Conclusion Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 15/ 25

General problem with multiple chunks Divisible task of size W Split into n chunks of size W i : n i=1 W i = W Chunk i is executed once at speed s i, and re-executed (if necessary) at speed σ i Unknowns: n, W i, s i, σ i n ( E(E) = Wi si 2 ) n + E C + λ i=1 i=1 ( Wi s i + T C ) (Wi σ 2 i + E C ) Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 16/ 25

Multiple chunks and single speed With a single speed, σ i = s i for each chunk Theorem: in optimal solution, n equal-sized chunks (W i = W n ), executed at same speed s i = s Proof by contradiction: consider two chunks W 1 and W 2 executed at speed s 1 and s 2, with either s 1 s 2, or s 1 = s 2 and W 1 W 2 Strictly better solution with two chunks of size w = (W 1 + W 2 )/2 and same speed s Only two unknowns, s and n Minimum speed with n chunks: s exp (n) = W 1 + 2λT C + 4 λd + 1 n 2(D nt C (1 + λt C )) Minimization of double-variable function, can be solved numerically both for expected and hard deadline Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 17/ 25

Multiple chunks and multiple speeds Need to find n, W i, s i, σ i With expected deadline: All re-execution speeds are equal (σ i = σ) and tight deadline All chunks have same size and are executed at same speed WIth hard deadline: If s i = s and σ i = σ, then all W i s are equal Conjecture: equal-sized chunks, same first-execution / re-execution speeds σ as a function of s, bound on s given n Minimization of double-variable function, can be solved numerically Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 18/ 25

Outline 1 Framework 2 With a single chunk 3 With several chunks 4 Simulation results 5 Conclusion Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 19/ 25

Simulation settings Large set of simulations: illustrate differences between models Maple software to solve problems We plot relative energy consumption as a function of λ The lower the better Given a deadline constraint (hard or expected), normalize with the result of single-chunk single-speed Impact of the constraint: normalize expected deadline with hard deadline Parameters varying within large ranges Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 20/ 25

Comparison with single-chunk single-speed Results identical for any value of W /D E 1.00 0.75 0.50 0.25 SCMSed MCSSed MCMSed Model (/SCSS) SCMShd MCSShd MCMShd 1e 06 1e 03 1e+00 lambda For expected deadline, with small λ (< 10 2 ), using multiple chunks or multiple speeds do not improve energy ratio: re-execution term negligible; increasing λ: improvement with multiple chunks For hard deadline, better to run at high speed during second execution: use multiple speeds; use multiple chunks if frequent failures Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 21/ 25

Expected vs hard deadline constraint E 1.00 0.75 0.50 0.25 Model SCSS MCSS SCMS MCMS 1e 06 1e 03 1e+00 lambda Important differences for single speed models, confirming previous conclusions: with hard deadline, use multiple speeds Multiple speeds: no difference for small λ: re-execution at maximum speed has little impact on expected energy consumption; increasing λ: more impact of re-execution, and expected deadline may use slower re-execution speed, hence reducing energy consumption Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 22/ 25

Outline 1 Framework 2 With a single chunk 3 With several chunks 4 Simulation results 5 Conclusion Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 23/ 25

Conclusion Energy consumption of a divisible load workload on volatile platforms Soft or hard deadline constraint Theoretical side: Formal models for the problem Expression of solutions as functions to minimize With multiple chunks, use same size chunks, same speed, and same re-execution speed (conjecture for multiple-speed hard-deadline) Simulations: Single-chunk single-speed is very good for expected deadline Hard deadline and small λ: use multiple speeds Large values of λ: use multiple speeds and multiple chunks Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 24/ 25

What we had: Energy-aware checkpointing + frequency scaling What we aim at: Anne.Benoit@ens-lyon.fr IGCC 2013 Energy-aware checkpointing 25/ 25