Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing

Size: px

Start display at page:

Download "Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing"

Irma Reynolds
5 years ago
Views:

1 Analysis of the Tradeoffs between Energy and Run Time for Multilevel Checkpointing Prasanna Balaprakash, Leonardo A. Bautista Gomez, Slim Bouguerra, Stefan M. Wild, Franck Cappello, and Paul D. Hovland ANL PMBS SC 14 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 1

Context and motivations Context: The Need For Speed Figure : From http://www.

2 Context and motivations Context: The Need For Speed Figure : From leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 2

3 Context and motivations Motivation: Failures Sequoia MTBF 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 3

4 Context and motivations Motivation: Failures Sequoia MTBF 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. 20 % of the computation is wasted in recovery and re-execution (Implies energy waste leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 3

5 Context and motivations Motivation: Failures Sequoia MTBF 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. 20 % of the computation is wasted in recovery and re-execution (Implies energy waste Exascale: The number of components for both memory and processors will increase by a factor of 100. Shrinking the circuit sizes and running at lower voltages, increases the SDC probability. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 3

6 Context and motivations Motivation: Failures Sequoia MTBF 1 day. Blue Waters 2 nodes failure per day. Titan MTBF < 1 day. 20 % of the computation is wasted in recovery and re-execution (Implies energy waste Exascale: The number of components for both memory and processors will increase by a factor of 100. Shrinking the circuit sizes and running at lower voltages, increases the SDC probability. In exascale failures will occur at higher frequency, optimistic MTBF is couple of hours. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 3

7 Context and motivations Motivation: Energy The power draw of the interconnect on Blue Gene/Q appears to be independent of load. CPU varies only by some 20% Power draw under different loads is DRAM change by a factor of 2 or more. Exascale: Data movement and IO will consume more than 70% of the total system power (most of the 20 MW will go just to power the 10 PB of total system memory. Flops/Watt VS Communication/Watts Avoid checkpointing and data movement do more re-computations. VS Avoid re-computations via checkpointing more often. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 4

8 Context and motivations Related Work ECOTFIT, Diouri et al Blocking checkpointing Message logging Conclusion no big tradeoff observed. Meneses et al Parallel recovery vs global recovery Used RALP API (No communication or IO are covered Parallel is better since it reduces the overall time Aupy et al Blocking vs no-blocking single level. No experiment. (ANL Tradeoffs between Energy and Run Time PMBS SC 14 5

9 Context and motivations Related Work ECOTFIT, Diouri et al Blocking checkpointing Message logging Conclusion no big tradeoff observed. Meneses et al Parallel recovery vs global recovery Used RALP API (No communication or IO are covered Parallel is better since it reduces the overall time Aupy et al Blocking vs no-blocking single level. No experiment. The missing episode What about multilevel checkpointing? (ANL Tradeoffs between Energy and Run Time PMBS SC 14 5

10 Problem formulation and notations 1 Context and motivations 2 Problem formulation and notations Multilevel checkpointing Energy model Multiobjective optimization 3 Simulation and experimentations Experimentations Tradeoff analysis 4 Conclusion and future work leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 6

11 Problem formulation and notations Multilevel checkpointing Multilevel Checkpointing Multiple levels of storage (DRAM, NVM, PFS. Coupled with data replication and erasure codes. Low levels offer high performance and partial reliability. High levels offer high reliability but impose large overhead. Different ckpt. levels have different frequencies. After a failure the application restart from the lowest available level. If unable to recover, try next level of checkpoint (Further in the past. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 7

12 Problem formulation and notations Multilevel checkpointing Wasted time model L levels of checkpoint (4 with FTI Checkpoint strategy: τ i, for i = 1 L Checkpoint cost: c i for level i r i time for a restart from level i d i downtime after a failure affecting level i. µ i rate of failures affecting level i. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 8

13 Problem formulation and notations Energy model Wasted energy model Pi c Pi r power for a level i checkpoint Watts. power for a restart from level i Watts. P a power for a failure-free computation without checkpointing Watts. µ i rate for failure affecting level i. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 9

14 Problem formulation and notations Multiobjective optimization Problem solving Checkpoint time W ch = L i=1 c i τ i i 1 + µ i τ i j=1 c j 2τ j leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 10

15 Problem formulation and notations Multiobjective optimization Problem solving Checkpoint time W ch = L i=1 c i τ i i 1 + µ i τ i j=1 c j 2τ j Rework time W rew = L i=1 µ i τ i 2 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 10

16 Problem formulation and notations Multiobjective optimization Problem solving Checkpoint time W ch = L i=1 c i τ i i 1 + µ i τ i j=1 c j 2τ j Rework time W rew = L i=1 µ i τ i 2 Downtime and restart time W down = L µ i (r i + d i i=1 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 10

17 Problem formulation and notations Multiobjective optimization Problem solving Checkpoint wasted energy E ch = L i=1 P c i c i i 1 Pj c + µ i τ c j i τ i 2τ j j=1 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 11

18 Problem formulation and notations Multiobjective optimization Problem solving Checkpoint wasted energy E ch = L i=1 P c i c i i 1 Pj c + µ i τ c j i τ i 2τ j j=1 Rework wasted energy E rew = L i=1 P a µ iτ i 2 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 11

19 Problem formulation and notations Multiobjective optimization Problem solving Checkpoint wasted energy E ch = L i=1 P c i c i i 1 Pj c + µ i τ c j i τ i 2τ j j=1 Rework wasted energy E rew = L i=1 P a µ iτ i 2 Downtime and restart wasted energy E down = L Pi r µ i (r i + d i i=1 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 11

20 Problem formulation and notations Multiobjective optimization Total wasted time W = L i=1 ( ( c i + µ iτ i 1 + τ i 2 i 1 j=1 c j 2τ j + µ i (r i + d i (1 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 12

21 Problem formulation and notations Multiobjective optimization Total wasted time W = L i=1 ( ( c i + µ iτ i 1 + τ i 2 i 1 j=1 c j 2τ j + µ i (r i + d i (1 Total wasted energy E = ( L P c i c i i=1 τ i ( + µ i τ P a i 2 + i 1 Pj cc j j=1 2τ j + L i=1 Pr i µ i(r i + d i, (2 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 12

22 Problem formulation and notations Multiobjective optimization First derivatives E τ i W τ i = µ i 2 = µ i 2 ( ( i j=1 i 1 P a + j=1 c j τ j P c j c j τ j c i τi 2 ( Pc i c i τ 2 i 1 + ( L j=i µ j τ j 2 L j=i+1 µ j τ j 2 (3 (4 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 13

23 Problem formulation and notations Multiobjective optimization First derivatives E τ i W τ i = µ i 2 = µ i 2 ( ( i j=1 i 1 P a + j=1 c j τ j P c j c j τ j c i τi 2 ( Pc i c i τ 2 i 1 + ( L j=i µ j τ j 2 L j=i+1 µ j τ j 2 (3 (4 Solutions τ W i = τ E i = c i(2 + L j=i+1 µ jτ W µ i (1 + i 1 c j j=1 τ W j j ρ ic i (2 + L j=i+1 µ jτ E µ i (1 + i 1 j=1 ρ j c j τj E j ρ i = P c i /Pa leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 13

24 Problem formulation and notations Multiobjective optimization Solutions ρ i = P c i /Pa τ W i = τ E i = c i(2 + L j=i+1 µ jτ W µ i (1 + i 1 c j j=1 τ W j j ρ ic i (2 + L j=i+1 µ jτ E µ i (1 + i 1 j=1 ρ j c j τj E j Solutions For one single level we have : τ W = 2c/µ and τ E = τ W P c /P a Whenever P c P a, we have that τ W τ E, and hence the two objectives are conflicting. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 14

25 Problem formulation and notations Multiobjective optimization Pareto front Definition τ i is said to be Pareto-optimal if it is not dominated by any other τ j. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 15

26 Problem formulation and notations Multiobjective optimization Pareto front Definition τ i is said to be Pareto-optimal if it is not dominated by any other τ j. Convex combination If the Pareto front is convex, any point on the front can be obtained by minimizing a linear combination of the objectives. Theorem f λ (τ = λw(τ + (1 λe(τ, for λ [0, 1]. The Hessian 2 ττ W(τ is diagonally dominant, and thus W is a convex function of τ over the domain (same for E(τ leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 15

27 Problem formulation and notations Multiobjective optimization Pareto front Convex combination If the Pareto front is convex, any point on the front can be obtained by minimizing a linear combination of the objectives. Theorem f λ (τ = λw(τ + (1 λe(τ, for λ [0, 1]. The Hessian 2 ττ W(τ is diagonally dominant, and thus W is a convex function of τ over the domain (same for E(τ τ i (λ = ( c i (λ + (1 λpi c 2 + L µ j τj j=i+1 (, (5 µ i λ + (1 λp a + i 1 (λ + (1 λpj c c j j=1 τ j leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 15

28 Problem formulation and notations Multiobjective optimization Pareto front Theorem The Hessian 2 ττ W(τ is diagonally dominant, and thus W is a convex function of τ over the domain (same for E(τ τ i (λ = ( c i (λ + (1 λpi c 2 + L µ j τj j=i+1 (, (5 µ i λ + (1 λp a + i 1 (λ + (1 λpj c c j j=1 τ j Case one level (L = 1 τ (λ = τ W λ + (1 λp c λ + (1 λp a. (6 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 15

29 Problem formulation and notations Multiobjective optimization Pareto Front (ANL Tradeoffs between Energy and Run Time PMBS SC 14 16

30 Simulation and experimentations 1 Context and motivations 2 Problem formulation and notations Multilevel checkpointing Energy model Multiobjective optimization 3 Simulation and experimentations Experimentations Tradeoff analysis 4 Conclusion and future work leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 17

31 Simulation and experimentations Experimentations Platform Mira 10 PF IBM Blue Gene/Q (BG/Q 49,152 nodes organized in 48 racks 16 cores of 1.6 GHz PowerPC A2 and 16 GB of DDR3 memory. 5-D torus network. Vesta (developmental platform for Mira Same as Mira s but with 2,048 nodes leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 18

32 Simulation and experimentations Experimentations Platform Mira 10 PF IBM Blue Gene/Q (BG/Q 49,152 nodes organized in 48 racks 16 cores of 1.6 GHz PowerPC A2 and 16 GB of DDR3 memory. 5-D torus network. Vesta (developmental platform for Mira Same as Mira s but with 2,048 nodes MonEQ for power measurement (resolution of 560 ms Chip core DRAM Network Collect power data only at the node card level (every 32 nodes The library can not measure the I/O power consumption!! leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 18

33 Simulation and experimentations Experimentations Applications LAMMPS Production-level molecular dynamics application. Lennard-Jones simulation of 1.3 billion atoms. Checkpoint size per node 200 MB ( 100GB application s footprint Checkpoints intervals 4, 8, 16 and 32 minutes. CORAL Qbox: first-principles molecular dynamics AMG: is a parallel algebraic multigrid solver for linear systems arising from problems on unstructured grids. LULESH: performs hydrodynamics stencil calculations minife: is a finite-element code. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 19

34 Simulation and experimentations Experimentations LAMMPS: synchronous Figure : Synchronous multilevel checkpointing 1.3 billion atoms Lennard-Jones simulation. 512 nodes running 64 MPI ranks per node (32,678 proc.. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 20

35 Simulation and experimentations Experimentations LAMMPS: asynchronous Figure : Asynchronous multilevel checkpointing 1.3 billion atoms Lennard-Jones simulation. 512 nodes running 64 MPI ranks per node (32,678 proc.. leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 21

Simulation and experimentations Experimentations CORAL benchmark (a LULESH (b MiniFE (c AMG (d Qbox 32 nodes. leobago@anl.

36 Simulation and experimentations Experimentations CORAL benchmark (a LULESH (b MiniFE (c AMG (d Qbox 32 nodes. Each(ANL application Tradeoffs is run between with Energy a and configuration Run Time PMBS of 16workshop ranks SC 14 22

37 Simulation and experimentations Tradeoff analysis Pareto fronts: Level 4 power consumption (e Level 4 power consumption P c 4 leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 23

38 Simulation and experimentations Tradeoff analysis Pareto fronts :Computation power (f Computation power P a leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 24

39 Simulation and experimentations Tradeoff analysis Pareto fronts : Power ratio Pc P a (g Power ratio Pc P a leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 25

40 Conclusion and future work 1 Context and motivations 2 Problem formulation and notations Multilevel checkpointing Energy model Multiobjective optimization 3 Simulation and experimentations Experimentations Tradeoff analysis 4 Conclusion and future work leobago@anl.gov (ANL Tradeoffs between Energy and Run Time PMBS SC 14 26

41 Conclusion and future work Summary Analytical models of performance and energy for multilevel checkpoint schemes. The pareto-front is obtained using convex combination. Power measurement experiments with production-level scientific applications running on over 32,000 MPI processes: The relative energy overhead of using FTI is minor and thus the tradeoffs are relatively small. Richer tradeoff exist when the power consumption of checkpointing is greater than that of the computation. (ANL Tradeoffs between Energy and Run Time PMBS SC 14 27

42 Conclusion and future work What is Next? Analyzing power profile of different fault tolerance protocols such as full/partial replication and message logging. The viability of replication with respect to the power cap of future exascale platforms (ANL Tradeoffs between Energy and Run Time PMBS SC 14 28

43 Conclusion and future work Questions? Thank You!! (ANL Tradeoffs between Energy and Run Time PMBS SC 14 29

Tradeoff between Reliability and Power Management

Tradeoff between Reliability and Power Management 9/1/2005 FORGE Lee, Kyoungwoo Contents 1. Overview of relationship between reliability and power management 2. Dakai Zhu, Rami Melhem and Daniel Moss e,