Scheduling algorithms for heterogeneous and failure-prone platforms

Size: px

Start display at page:

Download "Scheduling algorithms for heterogeneous and failure-prone platforms"

Shanna O’Neal’
5 years ago
Views:

1 Scheduling algorithms for heterogeneous and failure-prone platforms Yves Robert Ecole Normale Supérieure de Lyon & Institut Universitaire de France Joint work with Olivier Beaumont, Anne Benoit, Henri Casanova, Fanny Dufossé, Mathias Jacquelin, Arnaud Legrand, Loris Marchal, Paul Renaud-Goud, Arnold Rosenberg & Frédéric Vivien HCW, Anchorage, May 16, 2011 Scheduling algorithms 1/ 72

2 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Evolution of parallel machines From (good old) parallel architectures... Yves.Robert@ens-lyon.fr Scheduling algorithms 2/ 72

3 Evolution of parallel machines... to heterogeneous clusters... Scheduling algorithms 2/ 72

4 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Evolution of parallel machines... to large-scale grid platforms... Yves.Robert@ens-lyon.fr Scheduling algorithms 2/ 72

5 Evolution of parallel machines... everything hard-to-program... (courtesy Rajeev Thakur) Scheduling algorithms 2/ 72

6 Evolution of parallel machines... and definitely prone to failure! (courtesy Al Geist) Scheduling algorithms 2/ 72

7 New platforms, new problems, new solutions Parallel algorithm design and scheduling were already difficult tasks with homogeneous machines On heterogeneous multicore failure-prone platforms, it gets worse Need to adapt algorithms and scheduling strategies: new objective functions, new models Scheduling algorithms 3/ 72

New platforms, new problems, new solutions Parallel algorithm design and scheduling were already difficult tasks with homogeneous machines On heterogeneous multicore

8 New platforms, new problems, new solutions Parallel algorithm design and scheduling were already difficult tasks with homogeneous machines On heterogeneous multicore failure-prone platforms, it gets worse Need to adapt algorithms and scheduling strategies: new objective functions, new models Scheduling algorithms 3/ 72

9 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 4/ 72

10 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 5/ 72

11 Who needs a scheduler? Billions of (mostly idle) computers in the world All interconnected by these (mostly empty) network pipes Resources: abundant and cheap, not to say unlimited and free Virtual machines Greedy resource selection Demand-driven execution: first-come first-serve, round-robin Nobody needs a scheduler!!! Yves.Robert@ens-lyon.fr Scheduling algorithms 6/ 72

12 Who needs a scheduler? Billions of (mostly idle) computers in the world All interconnected by these (mostly empty) network pipes Resources: abundant and cheap, not to say unlimited and free Virtual machines Greedy resource selection Demand-driven execution: first-come first-serve, round-robin Let s prove this Nobody needs a scheduler!!! wrong?! Yves.Robert@ens-lyon.fr Scheduling algorithms 6/ 72

13 (A) Independent tasks no communication B independent equal-size tasks p processors P 1, P 2,..., P p w i = time for P i to process a task Intuition: load of P i proportional to its speed 1/w i Assign n i tasks to P i Objective: minimize T exe = max p i=1 n i =B (n i w i ) Yves.Robert@ens-lyon.fr Scheduling algorithms 7/ 72

14 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

15 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

16 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

17 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

18 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

19 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

20 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

21 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

22 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

23 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

24 Dynamic programming With 3 processors: w 1 = 3, w 2 = 5, and w 3 = 8 P 1 P 2 P 3 Task n 1 n 2 n 3 T exe Selected proc Yves.Robert@ens-lyon.fr Scheduling algorithms 8/ 72

25 Static versus dynamic Greedy (demand-driven) would have done a perfect job Would even be better (possible variations in processor speeds) Static assignment required useless thinking Yves.Robert@ens-lyon.fr Scheduling algorithms 9/ 72

26 Static versus dynamic Greedy (demand-driven) would have done a perfect job Would even be better (possible variations in processor speeds) Static assignment required useless thinking Yves.Robert@ens-lyon.fr Scheduling algorithms 9/ 72

27 (B) With dependencies still no communication i N 1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x T 2,3 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x A simple finite difference problem N 2 j Iteration space: 2D rectangle of size N 1 N {( ) ( )} Dependences between tiles, 0 1 Yves.Robert@ens-lyon.fr Scheduling algorithms 10/ 72

28 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

29 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

30 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

31 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

32 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

33 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

34 Allocation strategy (1/3) Use column-wise allocation to enhance locality Stepwise execution Yves.Robert@ens-lyon.fr Scheduling algorithms 11/ 72

35 Allocation strategy (2/3) With column-wise allocation, T opt N 1 N 2 p i=1 1 w i. Greedy (demand-driven) allocation slowdown?! Execution progresses at the pace of the slowest processor Yves.Robert@ens-lyon.fr Scheduling algorithms 12/ 72

36 Allocation strategy (2/3) With column-wise allocation, T opt N 1 N 2 p i=1 1 w i. Greedy (demand-driven) allocation slowdown?! Execution progresses at the pace of the slowest processor Yves.Robert@ens-lyon.fr Scheduling algorithms 12/ 72

37 Allocation strategy (3/3) With 3 processors, w 1 = 3, w 2 = 5, and w 3 = 8: i... P 1 P P 2 3 w =3 1 w =5 2 w =8 3 j T exe 8 3 N 1N N 1 N 2 T opt N 1N N 1 N 2 Yves.Robert@ens-lyon.fr Scheduling algorithms 13/ 72

38 Periodic static allocation (1/2) With 3 processors, w 1 = 3, w 2 = 5, and w 3 = 8: i... P 1 P P 2 3 w =3 1 w =5 2 w =8 3 j Assigning blocks of B = 10 columns, T exe 1.6 N 1 N 2 Yves.Robert@ens-lyon.fr Scheduling algorithms 14/ 72

39 Periodic static allocation (2/2) L = lcm(w 1, w 2,..., w p ) Example: L = lcm(3, 5, 8) = 120 P 1 receives first n 1 = L/w 1 columns, P 2 next n 2 = L/w 2 columns, and so on Period: block of B = n 1 + n n p contiguous columns Example: B = n 1 + n 2 + n 3 = = 79 Change schedule: - Sort processors so that n 1 w 1 n 2 w 2... n p w p - Process horizontally within blocks Optimal Yves.Robert@ens-lyon.fr Scheduling algorithms 15/ 72

40 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 16/ 72

41 Traditional scheduling Framework Application DAG G = (T, E, w) T = set of tasks E = dependence constraints w(t ) = computational cost of task T (execution time) c(t, T ) = communication cost (data sent from T to T ) Platform Set of p identical processors Schedule σ(t ) = date to begin execution of task T alloc(t ) = processor assigned to it Objective Makespan or total execution time MS(σ) = max (σ(t ) + w(t )) T T Yves.Robert@ens-lyon.fr Scheduling algorithms 17/ 72

42 Traditional scheduling Constraints w(t) T T comm(t,t ) w(t ) σ(t ) σ(t ) + w(t ) σ(t ) time Data dependences If (T, T ) E then if alloc(t ) = alloc(t ) then σ(t ) + w(t ) σ(t ) if alloc(t ) alloc(t ) then σ(t ) + w(t ) + c(t, T ) σ(t ) Resource constraints alloc(t ) = alloc(t ) (σ(t ) + w(t ) σ(t )) or (σ(t ) + w(t ) σ(t )) Yves.Robert@ens-lyon.fr Scheduling algorithms 18/ 72

43 Traditional scheduling About the model Simple but OK for computational resources No CPU sharing, even in models with preemption At most one task running per processor at any time-step Very crude for network resources Unlimited number of simultaneous sends/receives per processor No contention unbounded bandwidth on any link Fully connected interconnection graph (clique) In fact, model assumes infinite network capacity Scheduling algorithms 19/ 72

44 Makespan minimization NP-hardness Pb(p) NP-complete for independent tasks and no communications (E =, p = 2 and c = 0) Pb(p) NP-complete for UET-UCT graphs (w = c = 1) Approximation algorithms Without communications, list scheduling is a (2 1 p )-approximation With communications, result extends to coarse-grain graphs With communications, no λ-approximation in general Yves.Robert@ens-lyon.fr Scheduling algorithms 20/ 72

45 List scheduling Without communications Initialization: Compute priority level of all tasks Priority queue = list of free tasks (tasks without predecessors) sorted by priority While there remain tasks to execute: Add new free tasks, if any, to the queue. If there are q available processors and r tasks in the queue, remove first min(q, r) tasks from the queue and execute them Priority level Use critical path: longest path from the task to an exit node Computed recursively by a bottom-up traversal of the graph Yves.Robert@ens-lyon.fr Scheduling algorithms 21/ 72

46 List scheduling With communications Priority level Use pessimistic critical path: include all edge costs in the weight Computed recursively by a bottom-up traversal of the graph MCP Modified Critical Path Assign free task with highest priority to best processor Best processor = finishes execution first, given already taken scheduling decisions Free tasks may not be ready for execution (communication delays) May explore inserting the task in empty slots of schedule Complexity O( V log V + ( E + V )p) Yves.Robert@ens-lyon.fr Scheduling algorithms 22/ 72

47 Extending the model to heterogeneous resources Task graph with n tasks T 1,..., T n. Platform with p heterogeneous processors P 1,..., P p. Computation costs: - w iq = execution time of T i on P q p q=1 w iq - w i = p average execution time of T i - particular case: consistent tasks w iq = w i γ q Communication costs: - data(i, j): data volume for edge e ij : T i T j - v qr : communication time for unit-size message from P q to P r (zero if q = r) - com(i, j, q, r) = data(i, j) v qr communication time from T i executed on P q to P j executed on P r - com ij = data(i, j) cost for edge e ij : T i T j 1 q,r p,q r vqr p(p 1) average communication Yves.Robert@ens-lyon.fr Scheduling algorithms 23/ 72

48 HEFT: Heterogeneous Earliest Finish Time Priority level: rank(t i ) = w i + max (com ij + rank(t j )), T j Succ(T i ) where Succ(T ) is the set of successors of T Recursive computation by bottom-up traversal of the graph Allocation For current task T i, determine best processor P q : minimize σ(t i ) + w iq Enforce constraints related to communication costs Insertion scheduling: look for t = σ(t i ) s.t. P q is available during interval [t, t + w iq [ Complexity: same as MCP without/with insertion Yves.Robert@ens-lyon.fr Scheduling algorithms 24/ 72

49 What s wrong? Nothing (still may need to map a DAG onto a platform!) Absurd communication model: complicated: many parameters to instantiate while not realistic (clique + no contention) Wrong metric: need to relax makespan minimization objective Yves.Robert@ens-lyon.fr Scheduling algorithms 25/ 72

50 What s wrong? Nothing (still may need to map a DAG onto a platform!) Absurd communication model: complicated: many parameters to instantiate while not realistic (clique + no contention) Wrong metric: need to relax makespan minimization objective Yves.Robert@ens-lyon.fr Scheduling algorithms 25/ 72

51 What s wrong? Nothing (still may need to map a DAG onto a platform!) Absurd communication model: complicated: many parameters to instantiate while not realistic (clique + no contention) Wrong metric: need to relax makespan minimization objective Yves.Robert@ens-lyon.fr Scheduling algorithms 25/ 72

52 Favorite communication models One-port serialize incoming (and/or) outgoing communications pessimistic but realistic standard MPI Bounded multi-port several concurrent incoming and outgoing communications share total available bandwidth bottleneck = network card multi-threaded systems Yves.Robert@ens-lyon.fr Scheduling algorithms 26/ 72

53 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 27/ 72

54 Volunteer computing applications Monte Carlo methods Factoring large numbers Searching for Mersenne primes Particle detection at CERN many others: BOINC at Scheduling algorithms 28/ 72

55 PQ RS N T M NO T U M UV Q E F G B DE H I JA BA CDE / ( ) *,+ -. < 9 87 ;:9!"!" #" #" #" $ % # $ '& #" $ % # $ '& %' %' =>?@ L K Z M N ZQ Y [\ UX UQ Q NQ O X T W N UX YR X O V M UX Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

56 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A is the root of the tree; all tasks start at A Time for sending one task from A to B Time for computing one task in C Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

57 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A compute A send B receive B compute C receive C compute C send D receive D compute Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

58 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A compute A send B receive B compute C receive C compute C send D receive D compute Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

59 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A compute A send B receive B compute C receive C compute C send D receive D compute Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

60 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A compute A send B receive B compute C receive C compute C send D receive D compute Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

61 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A compute A send B receive B compute C receive C compute C send D receive D compute Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

62 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Example A compute A send Startup Repeated pattern Clean-up B receive B compute C receive C compute C send D receive D compute Steady-state: 7 tasks every 6 time units Yves.Robert@ens-lyon.fr Scheduling algorithms 29/ 72

63 Rule of the game M c 1 c 2 c p c i P 1 P 2 P i P p w 1 w 2 w i w p Master sends tasks sequentially, without preemption Full computation/communication overlap for each worker Worker P i receives a task in c i time-units Worker P i processes a task in w i time-units Yves.Robert@ens-lyon.fr Scheduling algorithms 30/ 72

64 Equations M c 1 c 2 c p c i P 1 P 2 P i P p w 1 w 2 w i w p Worker P i executes α i tasks per time-unit Computations: α i w i 1 One-port ommunications: i α ic i 1 Objective: maximize throughput ρ = i α i Yves.Robert@ens-lyon.fr Scheduling algorithms 30/ 72

65 Solution Faster-communicating workers first: c 1 c 2... Make full use of first q workers, where q largest index s.t. q i=1 c i w i 1 Make partial use of next worker P q+1 Discard other workers Bandwidth-centric strategy - Delegate work to the fastest communicating workers - It doesn t matter if these workers are computing slowly - Slow workers will not contribute much to overall throughput Yves.Robert@ens-lyon.fr Scheduling algorithms 31/ 72

66 Example M Fully active Discarded Tasks Communication Computation 6 tasks to P 1 6c 1 = 6 6w 1 = 18 3 tasks to P 2 3c 2 = 6 3w 2 = 18 2 tasks to P 3 2c 3 = 6 2w 3 = 2 11 tasks every 18 time-units (ρ = 11/18 0.6) Yves.Robert@ens-lyon.fr Scheduling algorithms 32/ 72

67 Example M Fully active Discarded Compare to purely greedy (demand-driven) strategy! 5 tasks every 36 time-units (ρ = 5/ ) Even if resources are cheap and abundant, resource selection is key to performance Yves.Robert@ens-lyon.fr Scheduling algorithms 32/ 72

68 Extension to trees Fully used node Partially used node Idle node Resource selection based on local information (children) Yves.Robert@ens-lyon.fr Scheduling algorithms 33/ 72

69 Does this really work? Can we deal with arbitrary platforms (including cycles)? Can we deal with return messages? In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Scheduling algorithms 34/ 72

70 Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Scheduling algorithms 34/ 72

71 Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Scheduling algorithms 34/ 72

72 Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? Yes In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Scheduling algorithms 34/ 72

73 Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? Yes In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Scheduling algorithms 34/ 72

74 Does this really work? Can we deal with arbitrary platforms (including cycles)? Yes Can we deal with return messages? Yes In fact, can we deal with more complex applications (arbitrary collections of DAGs)? Yes, I mean, almost! Yves.Robert@ens-lyon.fr Scheduling algorithms 34/ 72

75 LP formulation still works well... P k T n T m file e mn P j P i c ji c ik Conservation law m, n sent(p j P i, e mn ) + executed(p i, T m ) j = executed(p i, T n ) + k w i sent(p i P k, e mn ) Computations executed(p i, T m ) flops(t m ) w i 1 m Outgoing communications sent(p j P i, e mn ) bytes(e mn ) c ij 1 m,n j Yves.Robert@ens-lyon.fr Scheduling algorithms 35/ 72

76 ... but schedule reconstruction is harder A1 A2 A3 A4 A5 χ1 χ2 χ3 χ4 χ1 χ2 χ3 χ4 χ1 χ2 χ3 χ4 χ1 χ2 χ3 χ4 link43 link34 { { { { { { { { { { { { { { { { link42 link24 link32 link23 link31 link13 link21 link12 P4 P3 P2 P Actual (cyclic) schedule obtained in polynomial time Asymptotic optimality A couple of practical problems (large period, # buffers) No local scheduling policy Yves.Robert@ens-lyon.fr Scheduling algorithms 36/ 72

77 The beauty of steady-state scheduling Rationale Maximize throughput Simplicity Relaxation of makespan minimization problem Ignore initialization and clean-up Precise ordering of tasks/messages not needed Characterize resource activity per time-unit: - fraction of time spent computing for each application - fraction of time spent receiving from / sending to each neighbor? Efficiency Optimal throughput optimal schedule (up to a constant number of tasks) Periodic schedule, described in compact form compiling a loop instead of a DAG! Yves.Robert@ens-lyon.fr Scheduling algorithms 37/ 72

78 Lesson learnt? Resource selection is mandatory Scheduling algorithms 38/ 72

79 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 39/ 72

Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Scheduling workflows Data streams: images, frames, matrices, etc.

80 Introduction Who needs a scheduler? Background: scheduling DAGs Steady-state scheduling Energy-aware scheduling Schedulin Scheduling workflows Data streams: images, frames, matrices, etc. Structured applications: several steps to process each data set Pre-processing Encoding Final result Original images Goal: efficiently use computing resources Yves.Robert@ens-lyon.fr Scheduling algorithms 40/ 72

81 Example 4 processing stages, 3 processors at our disposal Where/how can we execute the application? Yves.Robert@ens-lyon.fr Scheduling algorithms 41/ 72

82 Example 4 processing stages, 3 processors at our disposal Where/how can we execute the application? S 1 20 S 2 15 S 3 1 S P P 10 1 P 3 Yves.Robert@ens-lyon.fr Scheduling algorithms 41/ 72

83 Example 4 processing stages, 3 processors at our disposal Where/how can we execute the application? S 1 20 S 2 15 S 3 1 S P P 10 1 P 3 Use all resources greedily Many communications to pay, not efficient at all! Yves.Robert@ens-lyon.fr Scheduling algorithms 41/ 72

84 Example 4 processing stages, 3 processors at our disposal Where/how can we execute the application? S 1 20 S 2 15 S 3 1 S P P 10 1 P 3 Everything on fastest processor: no communication Optimal execution time to process one single data Yves.Robert@ens-lyon.fr Scheduling algorithms 41/ 72

85 Example 4 processing stages, 3 processors at our disposal Where/how can we execute the application? S 1 20 S 2 15 S 3 1 S P P 10 1 P 3 Optimal throughput: processing different data in parallel Resource selection: do not use slowest processor Yves.Robert@ens-lyon.fr Scheduling algorithms 41/ 72

86 Optimization criteria Period P: time interval between the beginning of execution of two consecutive data sets (inverse of throughput) Latency L: maximal time elapsed between beginning and end of execution of a data set Reliability: inverse of F, probability of failure of the application (i.e., some data sets will not be processed) Energy E: total power dissipated by platform period period period Yves.Robert@ens-lyon.fr Scheduling algorithms 42/ 72

87 Optimization criteria Period P: time interval between the beginning of execution of two consecutive data sets (inverse of throughput) Latency L: maximal time elapsed between beginning and end of execution of a data set Reliability: inverse of F, probability of failure of the application (i.e., some data sets will not be processed) Energy E: total power dissipated by platform latency Yves.Robert@ens-lyon.fr Scheduling algorithms 42/ 72

88 Optimization criteria Period P: time interval between the beginning of execution of two consecutive data sets (inverse of throughput) Latency L: maximal time elapsed between beginning and end of execution of a data set Reliability: inverse of F, probability of failure of the application (i.e., some data sets will not be processed) Energy E: total power dissipated by platform failure probability Yves.Robert@ens-lyon.fr Scheduling algorithms 42/ 72

Reliability: inverse of F, probability of failure of the application (i.e., some data sets will not be processed)

89 Optimization criteria Period P: time interval between the beginning of execution of two consecutive data sets (inverse of throughput) Latency L: maximal time elapsed between beginning and end of execution of a data set Reliability: inverse of F, probability of failure of the application (i.e., some data sets will not be processed) Energy E: total power dissipated by platform The internet begins with coal Yves.Robert@ens-lyon.fr Scheduling algorithms 42/ 72

90 A primer on energy consumption Algorithmic techniques: Shut down idle processors Dynamic speed scaling: processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow VDD Hopping: from discrete modes to continuous speed Dissipated power The higher the speed, the higher the power consumption Power = f V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = s α + P static, with 2 α 3 Yves.Robert@ens-lyon.fr Scheduling algorithms 43/ 72

91 A primer on energy consumption Algorithmic techniques: Shut down idle processors Dynamic speed scaling: processors can run at variable speed, e.g., Intel XScale, Intel Speed Step, AMD PowerNow VDD Hopping: from discrete modes to continuous speed Dissipated power The higher the speed, the higher the power consumption Power = f V 2, and V (voltage) increases with f (frequency) Speed s: P(s) = s α + P static, with 2 α 3 Yves.Robert@ens-lyon.fr Scheduling algorithms 43/ 72

92 Example Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

93 Example Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

94 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

95 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

96 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

97 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

98 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

99 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

100 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

101 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

102 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

103 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

104 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

105 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

106 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

107 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

108 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

109 Example P = = 539 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

110 Example P = 539 P = 8 Period: T = 3 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

111 Example P = 539 P = 8 Period: T = 3 T = 15 Latency: L = 8 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

112 Example P = 539 P = 8 Period: T = 3 T = 15 Latency: L = 8 L = 17 Yves.Robert@ens-lyon.fr Scheduling algorithms 44/ 72

113 Multi-criteria optimization Performance-oriented objectives Environmental objectives Objective function Minimize period (inverse of throughput) Minimize latency (time to process a data set) Minimize application failure probability Minimize energy consumption Minimize platform cost Minimize α.p + β.l + γ.e? Minimize period, given a response time bound and an energy budget! Yves.Robert@ens-lyon.fr Scheduling algorithms 45/ 72

114 Multi-criteria optimization Performance-oriented objectives Environmental objectives Objective function Minimize period (inverse of throughput) P Minimize latency (time to process a data set) L Minimize application failure probability Minimize energy consumption E Minimize platform cost Minimize α.p + β.l + γ.e? Minimize period, given a response time bound and an energy budget! Yves.Robert@ens-lyon.fr Scheduling algorithms 45/ 72

115 Multi-criteria optimization Performance-oriented objectives Environmental objectives Objective function Minimize period (inverse of throughput) P Minimize latency (time to process a data set) L Minimize application failure probability Minimize energy consumption E Minimize platform cost Minimize α.p + β.l + γ.e? Minimize period, given a response time bound and an energy budget! Yves.Robert@ens-lyon.fr Scheduling algorithms 45/ 72

116 Research problems Single workflow Performance (period/latency) vs. environment (energy, cost) Robust mappings (replication for performance vs. for reliability) Several (concurrent) workflows Competition for CPU and network resources Fairness between applications: - max-min throughput - min-max stretch (slowdown factor) Sensitivity to application/platform parameter changes Yves.Robert@ens-lyon.fr Scheduling algorithms 46/ 72

117 Research problems Single workflow Performance (period/latency) vs. environment (energy, cost) Robust mappings (replication for performance vs. for reliability) Several (concurrent) workflows Competition for CPU and network resources Fairness between applications: - max-min throughput - min-max stretch (slowdown factor) Sensitivity to application/platform parameter changes Yves.Robert@ens-lyon.fr Scheduling algorithms 46/ 72

118 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 47/ 72

119 Cycle-stealing scenario (1/2) Execute 4 jobs A B C D during week-end Replicate them on 3 machines P 1, P 2 and P 3 Risk increases with time Machines reclaimed at 8am on Monday with probability 1 Yves.Robert@ens-lyon.fr Scheduling algorithms 48/ 72

120 Cycle-stealing scenario (1/2) Execute 4 jobs A B C D during week-end Replicate them on 3 machines P 1, P 2 and P 3 Risk increases linearly with time Machines reclaimed at 8am on Monday with probability 1 Yves.Robert@ens-lyon.fr Scheduling algorithms 48/ 72

121 Cycle-stealing scenario (2/2) A B C D P Yves.Robert@ens-lyon.fr Scheduling algorithms 49/ 72

122 Cycle-stealing scenario (2/2) A B C D P P 2 Yves.Robert@ens-lyon.fr Scheduling algorithms 49/ 72

123 Cycle-stealing scenario (2/2) A B C D P P Yves.Robert@ens-lyon.fr Scheduling algorithms 49/ 72

124 Cycle-stealing scenario (2/2) A B C D P P P 3 Yves.Robert@ens-lyon.fr Scheduling algorithms 49/ 72

125 Cycle-stealing scenario (2/2) A B C D P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 49/ 72

126 Cycle-stealing scenario (2/2) A B C D P P P A B C D P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 49/ 72

127 Problem n chunks (divisible computational workload) p identical computers Replicate p times the execution of each chunk 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Unrecoverable (fail-stop) interruptions A-priori knowledge of risk (failure probability) Goal: maximize expected amount of work done Yves.Robert@ens-lyon.fr Scheduling algorithms 50/ 72

128 Problem n chunks (divisible computational workload) p identical computers Replicate p times the execution of each chunk 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Unrecoverable (fail-stop) interruptions A-priori knowledge of risk (failure probability) Goal: maximize expected amount of work done Yves.Robert@ens-lyon.fr Scheduling algorithms 50/ 72

129 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

130 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

131 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

132 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

133 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

134 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

135 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

136 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

137 Symmetric schedules (1/2) 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } 1 } 2 3 {{ 4 } Group 1 Group 1 Group 3 Time P P P P Yves.Robert@ens-lyon.fr Scheduling algorithms 51/ 72

138 Symmetric schedules (2/2) Time P P P P Group 1 Group 2 Group Time-steps for group execution G ij = i-th execution of group j Yves.Robert@ens-lyon.fr Scheduling algorithms 52/ 72

139 Optimization problem (1/2) Group 1 Group 2 Group Yves.Robert@ens-lyon.fr Scheduling algorithms 53/ 72

140 Optimization problem (1/2) Group 1 Group 2 Group All four executions fail with probability prop. to Yves.Robert@ens-lyon.fr Scheduling algorithms 53/ 72

141 Optimization problem (1/2) Group 1 Group 2 Group All four executions fail with probability prop. to First execution failure probability λωg 11 G 11 = 1 Second execution failure probability λωg 21 G 21 = 6 Third execution failure probability λωg 31 G 31 = 9 Fourth execution failure probability λωg 41 G 41 = 12 Whole schedule failure probability λ 4 ω 4 Π 4 i=1 G i Yves.Robert@ens-lyon.fr Scheduling algorithms 53/ 72

142 Optimization problem (1/2) Group 1 Group 2 Group All four executions fail with probability prop. to Yves.Robert@ens-lyon.fr Scheduling algorithms 53/ 72

143 Optimization problem (1/2) Group 1 Group 2 Group All four executions fail with probability prop. to Yves.Robert@ens-lyon.fr Scheduling algorithms 53/ 72

144 Optimization problem (2/2) Problem E(jobdone) = n ( ) p p pω 1 λ p ω p G i,j j=1 Minimize K = n p p j=1 i=1 G i,j where entries of G are a permutation of [1..n] i=1 Bound K min = n p n/p n! Yves.Robert@ens-lyon.fr Scheduling algorithms 54/ 72

145 Heuristics (1/3) Group 1 Group 2 Group (a) Cyclic: K = 3104 Group 1 Group 2 Group (b) Reverse: K = 2368 Yves.Robert@ens-lyon.fr Scheduling algorithms 55/ 72

146 Heuristics (2/3) Group 1 Group 2 Group (c) Mirror: K = 2572 Group 1 Group 2 Group (d) Snake: K = 2464 Yves.Robert@ens-lyon.fr Scheduling algorithms 56/ 72

147 Heuristics (3/3) Step CCP Step CCP Step CCP Step (e) Greedy: K = 2368 Group 1 Group 2 Group (f) Optimal K = 2352 K min = ! = 2348 Yves.Robert@ens-lyon.fr Scheduling algorithms 57/ 72

148 A nice little algorithmic challenge Group 1 Group 2 Group 3... Group n/p P 1 x x x... x P 2 x x x... x P 3 x x x... x P 4 x x x... x P p x x x... x Fill up matrix with a permutation of [1..n] minimizing the sum of column products OK, OK, Greedy asymptotically optimal Still, this is a frustrating open problem Yves.Robert@ens-lyon.fr Scheduling algorithms 58/ 72

149 A nice little algorithmic challenge Group 1 Group 2 Group 3... Group n/p P 1 x x x... x P 2 x x x... x P 3 x x x... x P 4 x x x... x P p x x x... x Fill up matrix with a permutation of [1..n] minimizing the sum of column products OK, OK, Greedy asymptotically optimal Still, this is a frustrating open problem Yves.Robert@ens-lyon.fr Scheduling algorithms 58/ 72

150 Many potential sources of heterogeneity Processor of different speeds Links of different bandwidths Resources obeying different risk functions - different owner categories? Trade-off between speed and reliability Yves.Robert@ens-lyon.fr Scheduling algorithms 59/ 72

151 Mapping applications on volatile resources Iterative applications Independent / tightly-coupled tasks within each iteration Synchronization (checkpoint) after each iteration Master-worker paradigm Heterogeneous processors Limited available bandwidth from master to workers Volatile desktop grids Resource availability: UP / RECLAIMED / DOWN Markov process (with different transition probabilities) Goal: on-line policies for resource selection Which resources to enroll? fast or reliable? When to change configuration? Yves.Robert@ens-lyon.fr Scheduling algorithms 60/ 72

152 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 61/ 72

153 Motivation Framework Very very large number of processing elements (e.g., 2 20 ) Failure-prone platform (like any realistic platform) Large application to be executed = Failure(s) will indeed occur before completion! Questions When to checkpoint the application? Always use all processors? Yves.Robert@ens-lyon.fr Scheduling algorithms 62/ 72

154 Notations Overall size of work: W Checkpoint cost: C Downtime: D - software rejuvenation via rebooting - replacement by spare Recovery cost after failure: R Yves.Robert@ens-lyon.fr Scheduling algorithms 63/ 72

155 Makespan vs. NextFailure Makespan: Minimize job s expected makespan NextFailure: Maximize expected amount of work completed before next failure Rationale: NextFailure: optimization on a failure-by-failure basis Hopefully a good approximation, at least for large job sizes W Yves.Robert@ens-lyon.fr Scheduling algorithms 64/ 72

156 Makespan T (0 τ) = 0 T (ω τ) = ω 1 + C + T (ω ω 1 τ + ω 1 + C) if the processor does not fail during the next ω 1 + C units of time T wasted (ω 1 + C τ) + T (ω R) otherwise. Makespan: find ω 1 = f (ω τ) ω that minimizes E(T (W 0)) Yves.Robert@ens-lyon.fr Scheduling algorithms 65/ 72

157 NextFailure W (0 τ) = 0 W (ω τ) = ω 1 + W (ω ω 1 τ + ω 1 + C) if the processor does not fail during the next ω 1 + C units of time, 0 otherwise. NextFailure: find ω 1 = f (ω τ) ω that maximizes E(W (W 0)) Yves.Robert@ens-lyon.fr Scheduling algorithms 66/ 72

158 Single-processor jobs Makespan Exponential F (t) = 1 e λt : periodic strategy optimal Other (Weibull F (t) = 1 e tk λ k ): open NextFailure Open for all failure laws Accurate dynamic programming approximations Yves.Robert@ens-lyon.fr Scheduling algorithms 67/ 72

159 Parallel jobs Models Embarrassingly parallel jobs: W(p) = W/p Amdahl parallel jobs: W(p) = W/p + γw Numerical kernels: W(p) = W/p + γw 2/3 / p Accurate dynamic programming approximations Yves.Robert@ens-lyon.fr Scheduling algorithms 68/ 72

160 Parallel jobs γ = 0.01, MTBF = 26.5 years γ = 0.01, MTBF = 53 years γ = 0.01, MTBF = 106 years γ = 0.001, MTBF = 26.5 years γ = 0.001, MTBF = 53 years γ = 0.001, MTBF = 106 years loss(pused) p used : number of processors used (logscale) loss(p used ) vs p used for jobs obeying Amdahl s law on Jaguar Yves.Robert@ens-lyon.fr Scheduling algorithms 68/ 72

161 Problems Software/hardware techniques to reduce checkpoint, recovery, migration times and to improve failure prediction Self-fault-tolerant algorithms (e.g. asynchronous iterative) Multi-criteria scheduling problem throughput/energy/reliability Add replication and group execution Need combine all these approaches! Scheduling algorithms 69/ 72

162 Outline 1 Who needs a scheduler? 2 Background: scheduling DAGs 3 Steady-state scheduling 4 Energy-aware scheduling 5 Scheduling with replication 6 Checkpointing strategies 7 Conclusion Yves.Robert@ens-lyon.fr Scheduling algorithms 70/ 72

163 Tools for the road Forget absolute makespan minimization Resource selection mandatory Divisible load (fractional tasks) Single application: period / latency / energy / robustness Several applications: max-min fairness, MAX stretch Linear programming: absolute bound to assess heuristics Probabilities and stochastic models... unavoidable Yves.Robert@ens-lyon.fr Scheduling algorithms 71/ 72

164 Scheduling for large-scale platforms If platform is well identified and relatively stable, try to: (i) accurately model hierarchical structure (ii) design efficient/robust/energy-aware scheduling algorithms If platform is not stable enough, or if it evolves too fast, dynamic schedulers are the only option Otherwise, grab any opportunity to inject static knowledge into dynamic schedulers Is this opportunity a niche? Does it encompass a wide range of applications? Yves.Robert@ens-lyon.fr Scheduling algorithms 72/ 72

165 Scheduling for large-scale platforms If platform is well identified and relatively stable, try to: (i) accurately model hierarchical structure (ii) design efficient/robust/energy-aware scheduling algorithms If platform is not stable enough, or if it evolves too fast, dynamic schedulers are the only option The future Otherwise, grab any opportunity to inject static knowledge into dynamic schedulers Is this opportunity a niche? Does it encompass a wide range of applications? is on our side! Yves.Robert@ens-lyon.fr Scheduling algorithms 72/ 72

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

Scheduling divisible loads with return messages on heterogeneous master-worker platforms Olivier Beaumont, Loris Marchal, Yves Robert Laboratoire de l Informatique du Parallélisme École Normale Supérieure