Resilient and energy-aware algorithms

Size: px

Start display at page:

Download "Resilient and energy-aware algorithms"

Joel Chandler
5 years ago
Views:

1 Resilient and energy-aware algorithms Anne Benoit ENS Lyon CR /2017 CR02 Resilient and energy-aware algorithms 1/ 65

2 Motivation Need of resilient algorithms (see classes 3-7) Need of energy-aware algorithms (see classes 8-9)... And need to combine both! DVFS has an impact on resilience, so both problematics are linked... Also, does energy have an impact on the optimal checkpointing period? CR02 Resilient and energy-aware algorithms 2/ 65

3 Motivation Need of resilient algorithms (see classes 3-7) Need of energy-aware algorithms (see classes 8-9)... And need to combine both! DVFS has an impact on resilience, so both problematics are linked... Also, does energy have an impact on the optimal checkpointing period? CR02 Resilient and energy-aware algorithms 2/ 65

4 Outline 1 Optimal checkpointing period: time vs energy Framework Optimal period for execution time Optimal period for energy Experiments 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 3/ 65

5 Energy: a crucial issue Data centers ( Cloud Begins with Coal, M. Mills) TWh in 2013 consumption of Turkey (242), Spain (267), or Italy (309) 530Mt of CO 2 (carbontrust) > Canada crucial for both environmental and economical reasons Coordinated periodic checkpointing: what is the optimal checkpointing period if you optimize for Energy consumption? Is there a tradeoff between optimizing for Energy and optimizing for Time? Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 4/ 65

6 Outline 1 Optimal checkpointing period: time vs energy Framework Optimal period for execution time Optimal period for energy Experiments 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 5/ 65

7 Power model P Static : base power (platform switched on) Trend: goes down (w.r.t. other powers) P Cal : overhead due to CPU (computations) P I/O : overhead due to file I/O (checkpoint or recovery) P Down : overhead when one machine is down (rebooting) Meneses, Sarood and Kalé: Base power L = P Static E. Meneses, O. Sarood, and L.V. Kalé, Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems, in Proceedings of the 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2012), New York, USA, October Maximum power H = P Static + P Cal P I/O = 0 (and P Down = 0) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 6/ 65

8 Coordinated checkpointing Periodic checkpointing policy of period T Independent and identically distributed failures Applies to a single processor with MTBF µ = µ ind Applies to a platform with p processors with MTBF µ = µ ind p tightly-coupled application progress all processors available Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 7/ 65

9 Cost of checkpointing Time spent working Time spent checkpointing Time spent working with slowdown Time Computing the first chunk Checkpointing the first chunk Processing the first chunk General model: while a checkpoint is taken, computations are slowed-down: during a checkpoint of duration C, the same amount of computation is done as during a time ωc without checkpointing (0 ω 1). Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 8/ 65

10 Outline 1 Optimal checkpointing period: time vs energy Framework Optimal period for execution time Optimal period for energy Experiments 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 9/ 65

11 Expected execution time T base execution time without any overhead T final = T ff + T fails total execution time Time for fault-free execution Time lost due to failures T T ff = T base T (1 ω)c T fails = T final (D + R + Re-Exec) µ Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 10/ 65

12 AlgoT: Strategy with T opt Time Computation of the optimal checkpointing period in the non-blocking case: See Course 4, Section 4: Assessing protocols at scale (with α instead of ω) T opt Time = 2(1 ω)c(µ (D + R + ωc)) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 11/ 65

13 Outline 1 Optimal checkpointing period: time vs energy Framework Optimal period for execution time Optimal period for energy Experiments 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 12/ 65

14 Consumed energy E final = T Cal P Cal + T I/O P I/O + T Down P Down + T final P Static ( = T base + T final (ωc + T 2 C 2 + ωc 2 )) P Cal µ 2T 2T ( Tfinal + (R + C 2 ) ) T base + C P µ 2T T (1 ω)c I/O + T final µ DP Down + T final P Static T final T Cal + T I/O + T Down, unless ω = 0 CPU and I/O activities are overlapped (and both consumed) when checkpointing Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 13/ 65

15 AlgoE: Strategy with T opt Energy P Cal = αp Static, P I/O = βp Static, P Down = γp Static (T a) 2( b 2µ T ) 2 E P Static T base final = ab+t 2 ( ) 2µ (αωc +βr +γd +µ)+ µ αt 2 + α(1 ω)c2 + βc2 2T 2T + (T a)(b 2µ T ) ( α+ α(1 ω)c2 βc 2 ) ( ) βc b 2µ T 2µ T 2 ( ) = T 3( ) 1 4µ 4µ 1 +T 2 αωc+βr+γd 2µ 2µ βc 2µ 4µ ( ) +T 2µ ab 2µ ab + βcb (α(1 ω) β)c2 2 µ 4µ 2 βcb 2 ( ) ab(αωc+βr+γd+µ) b µ 2µ a 4µ 2 (α(1 ω) β)c 2 ( ) + T 1 (α(1 ω) β) 2µ C (α(1 ω) β) C ) 2µ 2µ 2 + 2µ b + a βc 4µ 2 + 2µ 1 ( ) +T = T 2 ( αωc+βr+γd 2µ 2 + b+ a (βc a)b 2 (α(1 ω) β)c2 µ 4µ 2 ab(αωc+βr+γd+µ) βcb 2 ( µ ) + b 2µ + a 4µ 2 (α(1 ω) β)c 2. Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 14/ 65

16 AlgoE: Strategy with T opt Energy P Cal = αp Static, P I/O = βp Static, P Down = γp Static (T a) 2( b 2µ T ) 2 E P Static T base final = ab+t 2 ( ) 2µ (αωc +βr +γd +µ)+ µ αt 2 + α(1 ω)c2 + βc2 2T 2T + (T a)(b 2µ T ) ( α+ α(1 ω)c2 βc 2 ) ( ) βc b 2µ T 2µ T 2 ( ) = T 3( ) 1 4µ 4µ 1 +T 2 αωc+βr+γd 2µ 2µ βc 2µ 4µ ( ) +T 2µ ab 2µ ab + βcb (α(1 ω) β)c2 2 µ 4µ 2 βcb 2 ( ) ab(αωc+βr+γd+µ) b µ 2µ a 4µ 2 (α(1 ω) β)c 2 ( ) + T 1 (α(1 ω) β) 2µ C (α(1 ω) β) C ) 2µ 2µ 2 + 2µ b + a βc 4µ 2 + 2µ 1 ( ) +T = T 2 ( αωc+βr+γd 2µ 2 + b+ a (βc a)b 2 (α(1 ω) β)c2 µ 4µ 2 ab(αωc+βr+γd+µ) βcb 2 ( µ ) + b 2µ + T a opt 4µ 2 (α(1 ω) β)c 2. We let Maple compute Energy Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 14/ 65

17 Outline 1 Optimal checkpointing period: time vs energy Framework Optimal period for execution time Optimal period for energy Experiments 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 15/ 65

18 Parameters: power ρ = P Static + P I/O P Static + P Cal = 1 + β 1 + α 20 Mega-watts for Exascale platform with 10 6 nodes Nominal power = 20 milli-watts per node 1/2 1/4 of that power in static consumption I/O an order of magnitude more than computing (J. Shalf, S. Dosanjh, and J. Morrison, Exascale computing technology challenges, in the 9th Int. Conf. High Performance Computing for Computational Science, 2011) Scenario 1: P Static = 10, P Cal = 10, P I/O = 100 ρ = 5.5 Scenario 2: P Static = 5, P Cal = 10, P I/O = 100 ρ = 7 Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 16/ 65

19 Parameters: resilience MTBF N = 45, 208 processors: one fault per day Individual (processor) MTBF µ ind 125 years. Total number of processors N: from N = 219, 150 to N = 2, 191, 500 µ = 300 min down to µ = 30 min C = R = 10 min, D = 1 min, and ω = 1/2. Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 17/ 65

20 Impact of ratio ρ T final (T energy )/ E final (T time )/ T final (T energy )/ E final (T time )/ T final (T time ) E final (T energy ) T final (T time ) E final (T energy ) ρ (µ=300) (µ=120) (µ=30) ρ ρ (µ=300) How much slower, (µ=120) if we optimize for energy instead of optimizing (µ=30) for time Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 18/ 65

21 T final (T energy )/ Impact of1.12 ratio ρ E final (T time )/ T final (T time ) E final (T energy ) (µ=300) (µ=120) (µ=30) ρ How much more energy consumption, if we optimize for time instead of optimizing for energy Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 19/ 65

22 AlgoT over AlgoE 1.25 How much slower, if we optimize for energy instead of optimizing for time How much more energy consumption, if we optimize for time instead of optimizing for energy ρ ρ µ µ Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 20/ 65

23 Scalability (ρ = 5.5) T final (T energy )/ E final (T time )/ T final (T time ) E final (T energy ) Number of nodes (ρ=5.5) (ρ=5.5) Number of nodes µ = 120 min for 10 6 nodes, C = R = 1 min, D = 0.1 min, ω = 1/2 Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 21/ 65

24 Scalability (ρ = 7) T final (T energy )/ E final (T time )/ T final (T time ) E final (T energy ) Number of nodes (ρ=7) (ρ=7) Number of nodes µ = 120 min for 10 6 nodes, C = R = 1 min, D = 0.1 min, ω = 1/2 Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 22/ 65

25 Conclusion Coordinated checkpointing, non-blocking Different optimal periods for time and energy Save more than 20% of energy with 10% increase in time Expect more gains for large-scale platforms Variety of resilience and power consumption parameters Quite flexible analytical model Easy to instantiate for other scenarios CR02 Resilient and energy-aware algorithms 23/ 65

26 Conclusion Coordinated checkpointing, non-blocking Different optimal periods for time and energy Save more than 20% of energy with 10% increase in time Expect more gains for large-scale platforms Variety of resilience and power consumption parameters Quite flexible analytical model Easy to instantiate for other scenarios CR02 Resilient and energy-aware algorithms 23/ 65

27 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption Model for one single chunk Model for multiple chunks and optimization problem Solving the problems Simulations 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 24/ 65

28 Framework Execution of a divisible task (W operations) Failures may occur Transient failures Resilience through checkpointing Objective: minimize expected energy consumption E(E), given a deadline bound D Probabilistic nature of failure hits: expectation of energy consumption is natural (average cost over many executions) Deadline bound: two relevant scenarios (soft or hard deadline) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 25/ 65

29 Soft vs hard deadline Soft deadline: met in expectation, i.e., E(T ) D (average response time) Hard deadline: met in the worst case, i.e., T wc D VS Hard (worst-case) Soft (expected) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 26/ 65

30 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption Model for one single chunk Model for multiple chunks and optimization problem Solving the problems Simulations 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 27/ 65

31 Execution time, one single chunk One single chunk of size W Checkpoint overhead: execution time T C Instantaneous failure rate: λ First execution at speed s: T exec = W s + T C Failure probability: P fail = λt exec = λ( W s + T C ) In case of failure: re-execute at speed σ: T reexec = W σ And we assume success after re-execution + T C E(T ) = T exec + P fail T reexec = ( W s + T C ) + λ( W s + T C )( W σ + T C ) T wc = T exec + T reexec = ( W s + T C ) + ( W σ + T C ) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 28/ 65

32 Energy consumption, one single chunk One single chunk of size W Checkpoint overhead: energy consumption E C First execution at speed s: W s s 3 + E C = Ws 2 + E C Re-execution at speed σ: W σ 2 + E C, with probability P fail (Pfail = λt exec = λ( W s + T C ) ) E(E) = (Ws 2 + E C ) + λ ( W s + T C ) ( W σ 2 + E C ) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 29/ 65

33 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption Model for one single chunk Model for multiple chunks and optimization problem Solving the problems Simulations 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 30/ 65

34 Multiple chunks Execution times: sum of execution times for each chunk (worst-case or expected) Expected energy consumption: sum of expected energy for each chunk Coherent failure model: consider two chunks W 1 + W 2 = W Probability of failure for first chunk: Pfail 1 = λ( W 1 s + T C ) For second chunk: Pfail 2 = λ( W 2 s + T C ) With a single chunk of size W : P fail = λ( W s + T C ), differs from Pfail 1 + P2 fail only because of extra checkpoint Trade-off: many small chunks (more T C to pay, but small re-execution cost) vs few larger chunks (fewer T C, but increased re-execution cost) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 31/ 65

35 Optimization problem Decisions that should be taken before execution: Chunks: how many (n)? which sizes (W i for chunk i)? Speeds of each chunk: first run (s i )? re-execution (σ i )? Input: W, T C (checkpointing time), E C (energy spent for checkpointing), λ (instantaneous failure rate), D (deadline) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 32/ 65

36 Optimization problem Decisions that should be taken before execution: Chunks: how many (n)? which sizes (W i for chunk i)? Speeds of each chunk: first run (s i )? re-execution (σ i )? Input: W, T C (checkpointing time), E C (energy spent for checkpointing), λ (instantaneous failure rate), D (deadline) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 32/ 65

37 Optimization problem Decisions that should be taken before execution: Chunks: how many (n)? which sizes (W i for chunk i)? Speeds of each chunk: first run (s i )? re-execution (σ i )? Input: W, T C (checkpointing time), E C (energy spent for checkpointing), λ (instantaneous failure rate), D (deadline) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 32/ 65

38 Models Chunks Single chunk of size W Speed per chunk VS VS Multiple chunks (n and W i s) Single speed (s) Multiple speeds (s and σ) Deadline bound VS Hard (T wc D) Soft (E(T ) D) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 33/ 65

39 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption Model for one single chunk Model for multiple chunks and optimization problem Solving the problems Simulations 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 34/ 65

40 Single chunk and single speed Consider first that s = σ (single speed): need to find optimal speed E(E) is a function of s: E(E)(s) = (Ws 2 + E C )(1 + λ( W s + T C )) Lemma: this function is convex and has a unique minimum s (function of λ, W, E c, T c ) s = λw 6(1+λT C ) ( (3 3 where a = λe C ( 2(1+λTC ) λw 27a 2 4a 27a+2) 1/3 2 1/3 (3 3 ) 2 E(T ) and T wc : decreasing functions of s ) 2 27a 1/3 1, 2 4a 27a+2) 1/3 Minimum speed s exp and s wc required to match deadline D (function of D, W, T c, and λ for s exp ) Optimal speed: maximum between s and s exp or s wc Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 35/ 65

41 Single chunk and multiple speeds Consider now that s σ (multiple speeds): two unknowns E(E) is a function of s and σ: E(E)(s, σ) = (Ws 2 + E C ) + λ( W s + T C )(W σ 2 + E C ) Lemma: energy minimized when deadline tight (both for wc and exp) σ expressed as a function of s: σ exp = λw D, (1+λT W C ) σwc = W (D 2T C )s W s +T s C Minimization of single-variable function, can be solved numerically (no expression of optimal s) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 36/ 65

42 General problem with multiple chunks Divisible task of size W Split into n chunks of size W i : n i=1 W i = W Chunk i is executed once at speed s i, and re-executed (if necessary) at speed σ i Unknowns: n, W i, s i, σ i n ( E(E) = Wi si 2 ) n + E C + λ i=1 i=1 ( Wi s i + T C ) (Wi σ 2 i + E C ) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 37/ 65

43 Multiple chunks and single speed With a single speed, σ i = s i for each chunk Theorem: in optimal solution, n equal-sized chunks (W i = W n ), executed at same speed s i = s Proof by contradiction: consider two chunks W 1 and W 2 executed at speed s 1 and s 2, with either s 1 s 2, or s 1 = s 2 and W 1 W 2 Strictly better solution with two chunks of size w = (W 1 + W 2 )/2 and same speed s Only two unknowns, s and n Minimum speed with n chunks: s exp (n) = W 1 + 2λT C + 4 λd + 1 n 2(D nt C (1 + λt C )) Minimization of double-variable function, can be solved numerically both for expected and hard deadline Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 38/ 65

44 Multiple chunks and multiple speeds Need to find n, W i, s i, σ i With expected deadline: All re-execution speeds are equal (σ i = σ) and tight deadline All chunks have same size and are executed at same speed With hard deadline: If s i = s and σ i = σ, then all W i s are equal Conjecture: equal-sized chunks, same first-execution / re-execution speeds σ as a function of s, bound on s given n Minimization of double-variable function, can be solved numerically Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 39/ 65

45 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption Model for one single chunk Model for multiple chunks and optimization problem Solving the problems Simulations 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 40/ 65

46 Simulation settings Large set of simulations: illustrate differences between models Maple software to solve problems We plot relative energy consumption as a function of λ The lower the better Given a deadline constraint (hard or expected), normalize with the result of single-chunk single-speed Impact of the constraint: normalize expected deadline with hard deadline Parameters varying within large ranges Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 41/ 65

47 Comparison with single-chunk single-speed Results identical for any value of W /D E SCMSed MCSSed MCMSed Model (/SCSS) SCMShd MCSShd MCMShd 1e 06 1e 03 1e+00 lambda For expected deadline, with small λ (< 10 2 ), using multiple chunks or multiple speeds do not improve energy ratio: re-execution term negligible; increasing λ: improvement with multiple chunks For hard deadline, better to run at high speed during second execution: use multiple speeds; use multiple chunks if frequent failures Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 42/ 65

48 Expected vs hard deadline constraint E Model SCSS MCSS SCMS MCMS 1e 06 1e 03 1e+00 lambda Important differences for single speed models, confirming previous conclusions: with hard deadline, use multiple speeds Multiple speeds: no difference for small λ: re-execution at maximum speed has little impact on expected energy consumption; increasing λ: more impact of re-execution, and expected deadline may use slower re-execution speed, hence reducing energy consumption Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 43/ 65

49 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy Complexity results Heuristics Simulation results 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 44/ 65

50 Framework DAG: G = (V, E) n = V tasks T i of weight w i p identical processors fully connected DVFS, Continuous model: interval of available continuous speeds [s min, s max ] One speed per task Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 45/ 65

51 Makespan Execution time of T i at speed s i : d i = w i s i If T i is executed twice on the same processor at speeds s i and s i : d i = w i s i + w i s i Constraint on makespan: end of execution before deadline D (hard deadline constraint) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 46/ 65

52 Reliability Transient failure: local, no impact on the rest of the system Transient failure rate: Poisson distribution of parameter: λ(s) = λ 0 e d smax s smax s min. Reliability R i of task T i as a function of speed s i : R i (s i ) = e λ(s i )Exe(w i,s i ) = (1st order) 1 λ 0 e ds i w i s i Threshold reliability (and hence speed s rel ) R i (s rel 1) R i (s) s min s rel s max Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 47/ 65 s

53 Re-execution: a task is re-executed on the same processor, just after its first execution With two executions, reliability R i of task T i is: R i = 1 (1 R i (s i ))(1 R i (s i )) Constraint on reliability: Reliability: R i R i (s rel ), and at most one re-execution Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 48/ 65

54 Energy Energy to execute task T i once at speed s i : E i (s i ) = w i s 2 i Dynamic part of classical energy models With re-executions, it is natural to take the worst-case scenario: ( Energy : E i = w i s 2 i + s i 2 ) Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 49/ 65

55 Energy and reliability: set of possible speeds Energy w i s 2 i + w i s 2 i = 2E i (s i ) w i s 2 i = E i (s i ) E i (s rel ) s rel 2 s rel Speed Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 50/ 65

56 Tri-Crit-Cont Given G = (V, E) Find A schedule of the tasks A set of tasks I = {i T i is executed twice} Execution speed s i for each task T i Re-execution speed s i for each task in I such that i I w i (s 2 i + s 2 i ) + i / I w i s 2 i is minimized, while meeting reliability and deadline constraints Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 51/ 65

57 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy Complexity results Heuristics Simulation results 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 52/ 65

58 Complexity results One speed per task Re-execution at same speed as first execution, i.e., s i = s i Tri-Crit-Cont is NP-hard even for a linear chain, but not known to be in NP (because of continuous model) Polynomial-time solution for a fork Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 53/ 65

59 Complexity results with Vdd-Hopping Each task is computed using at most two different speeds Tri-Crit-Vdd is NP-complete even for a linear chain Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 54/ 65

60 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy Complexity results Heuristics Simulation results 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 55/ 65

61 Energy-reducing heuristics Two steps: Mapping (NP-hard) List scheduling Speed scaling + re-execution (NP-hard) Energy reducing The list scheduling heuristic maps tasks onto processors at speed s max, and we keep this mapping in step two Step two = slack reclamation: use of deceleration and re-execution Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 56/ 65

62 Deceleration and re-execution Deceleration: select a set of tasks that we execute at speed max max(s rel, s i=1..n t i max D ): slowest possible speed meeting both reliability and deadline constraints Re-execution: greedily select tasks for re-execution Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 57/ 65

63 Super-weight (SW) of a task SW: sum of the weights of the tasks (including T i ) whose execution interval is included into T i s execution interval SW of task slowed down = estimation of the total amount of work that can be slowed down together with that task p 4 T i p 3 p 2 p 1 s e time Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 58/ 65

64 Selected heuristics A.SUS-Crit: efficient on DAGs with low degree of parallelism max Set the speed of every task to max(s rel, s i=1..n t i max D ) Sort the tasks of every critical path according to their SW and try to re-execute them Sort all the tasks according to their weight and try to re-execute them B.SUS-Crit-Slow: good for highly parallel tasks: re-execute, then decelerate Sort the tasks of every critical path according to their SW and try to re-execute them. If not possible, then try to slow them down Sort all tasks according to their weight and try to re-execute them. If not possible, then try to slow them down Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 59/ 65

65 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy Complexity results Heuristics Simulation results 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 60/ 65

66 Results We compare the impact of: the number of processors p the ratio D of the deadline over the minimum deadline D min (given by the list-scheduling heuristic at speed s max ) on the output of each heuristic Results normalized by heuristic running each task at speed s max ; the lower the better Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 61/ 65

67 Results 1 1 A.SUS-Crit A.SUS-Crit 0.9 B.SUS-Crit-Slow 0.9 B.SUS-Crit-Slow Eg / Eg_fmax Eg / Eg_fmax Number of processors Number of processors With increasing p, D = 1.2 (left), D = 2.4 (right) A better when number of processors is small B better when number of processors is large Superiority of B for tight deadlines: decelerates critical tasks that cannot be re-executed Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 61/ 65

68 Summary Tri-criteria energy/makespan/reliability optimization problem Various theoretical results Two-step approach for polynomial-time heuristics: List-scheduling heuristic Energy-reducing heuristics Two complementary energy-reducing heuristics for Tri-Crit-Cont CR02 Resilient and energy-aware algorithms 62/ 65

69 Outline 1 Optimal checkpointing period: time vs energy 2 Checkpointing and energy consumption 3 Tri-criteria problem: execution time, reliability, energy 4 Conclusion Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 63/ 65

70 Conclusion Resilience and energy consumption are two of the main challenges for Exascale platforms Revisiting checkpointing techniques for reliability while minimizing energy consumption Tri-criteria heuristics aiming at minimizing the energy consumption, with re-execution to deal with reliability... Still a lot of challenging algorithmic problems on these hot topics Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 64/ 65

71 Conclusion Resilience and energy consumption are two of the main challenges for Exascale platforms Revisiting checkpointing techniques for reliability while minimizing energy consumption Tri-criteria heuristics aiming at minimizing the energy consumption, with re-execution to deal with reliability... Still a lot of challenging algorithmic problems on these hot topics Anne.Benoit@ens-lyon.fr CR02 Resilient and energy-aware algorithms 64/ 65

72 Bibliography Optimal checkpointing period: time vs energy (Aupy, Benoit, Hérault, Robert, Dongarra, 2013) Energy-aware checkpointing of divisible tasks with soft or hard deadlines (Aupy, Benoit, Melhem, Renaud-Goud, Robert, 2013) Energy-aware scheduling under reliability and makespan constraints (Aupy, Benoit, Robert, 2012) CR02 Resilient and energy-aware algorithms 65/ 65

Energy-aware checkpointing of divisible tasks with soft or hard deadlines

Energy-aware checkpointing of divisible tasks with soft or hard deadlines Guillaume Aupy 1, Anne Benoit 1,2, Rami Melhem 3, Paul Renaud-Goud 1 and Yves Robert 1,2,4 1. Ecole Normale Supérieure de Lyon,