A Tighter Analysis of Work Stealing

Similar documents
arxiv: v1 [cs.dc] 19 Jul 2011

Work Stealing with Parallelism Feedback

Provably Efficient Two-Level Adaptive Scheduling

Adaptive Work Stealing with Parallelism Feedback

List Scheduling: The Price of Distribution

Work Stealing with Parallelism Feedback

Multi-threading model

On scheduling the checkpoints of exascale applications

Scheduling Parallel DAG Jobs Online to Minimize Average Flow Time

A Mean Field Model of Work Stealing in Large-Scale Systems

Revisiting the Cache Miss Analysis of Multithreaded Algorithms

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling

Scheduling divisible loads with return messages on heterogeneous master-worker platforms

Operator assignment problem in aircraft assembly lines: a new planning approach taking into account economic and ergonomic constraints

A fast 5/2-approximation algorithm for hierarchical scheduling

The Cache Complexity of Multithreaded Cache Oblivious Algorithms

Periodic I/O Scheduling for Supercomputers

TDDB68 Concurrent programming and operating systems. Lecture: CPU Scheduling II

Federated Scheduling for Stochastic Parallel Real-time Tasks

Adaptive Scheduling with Parallelism Feedback

Dependency Graph Approach for Multiprocessor Real-Time Synchronization. TU Dortmund, Germany

Parallel Performance Theory - 1

P C max. NP-complete from partition. Example j p j What is the makespan on 2 machines? 3 machines? 4 machines?

CSCE 313 Introduction to Computer Systems. Instructor: Dezhen Song

Common-Deadline Lazy Bureaucrat Scheduling Problems

2/5/07 CSE 30341: Operating Systems Principles

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Load Balancing in Distributed Service System: A Survey

Review for the Midterm Exam

Average-Case Performance Analysis of Online Non-clairvoyant Scheduling of Parallel Tasks with Precedence Constraints

Module 5: CPU Scheduling

Chapter 6: CPU Scheduling

Optimizing Energy Consumption under Flow and Stretch Constraints

The Natural Work-Stealing Algorithm is Stable

Energy-efficient scheduling

An Empirical Evaluation of Work Stealing with Parallelism Feedback

On-line Scheduling to Minimize Max Flow Time: An Optimal Preemptive Algorithm

Multiprocessor Scheduling II: Global Scheduling. LS 12, TU Dortmund

Comp 204: Computer Systems and Their Implementation. Lecture 11: Scheduling cont d

Transparent Fault Tolerance for Scalable Functional Computation

Analysis of Scheduling Algorithms with Reservations

A Framework for Scheduling with Online Availability

Parallel Performance Theory

A Semiconductor Wafer

Balanced Dense Polynomial Multiplication on Multicores

The Greedy Method. Design and analysis of algorithms Cs The Greedy Method

Scheduling Parallel Jobs with Linear Speedup

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

1 Ordinary Load Balancing

ENHANCING CPU PERFORMANCE USING SUBCONTRARY MEAN DYNAMIC ROUND ROBIN (SMDRR) SCHEDULING ALGORITHM

CS-206 Concurrency. Lecture 10. Scheduling & Work Distribution. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

Accelerating linear algebra computations with hybrid GPU-multicore systems.

CHAPTER 5 - PROCESS SCHEDULING

Online Scheduling of Parallel Jobs on Two Machines is 2-Competitive

Task Models and Scheduling

Online Scheduling Switch for Maintaining Data Freshness in Flexible Real-Time Systems

Dynamic Time Quantum based Round Robin CPU Scheduling Algorithm

Fine Grain Quality Management

Outline / Reading. Greedy Method as a fundamental algorithm design technique

LSN 15 Processor Scheduling

CHAPTER 16: SCHEDULING

Non-Work-Conserving Non-Preemptive Scheduling: Motivations, Challenges, and Potential Solutions

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

Optimizing Performance and Reliability on Heterogeneous Parallel Systems: Approximation Algorithms and Heuristics

Performance comparison of aggressive push and traditional pull strategies in large distributed systems

Notation. Bounds on Speedup. Parallel Processing. CS575 Parallel Processing

Santa Claus Schedules Jobs on Unrelated Machines

The makespan problem of scheduling multi groups of jobs on multi processors at different speeds

How to deal with uncertainties and dynamicity?

Journal of Global Research in Computer Science

Multi-core Real-Time Scheduling for Generalized Parallel Task Models

TDDI04, K. Arvidsson, IDA, Linköpings universitet CPU Scheduling. Overview: CPU Scheduling. [SGG7] Chapter 5. Basic Concepts.

CS 370. FCFS, SJF and Round Robin. Yashwanth Virupaksha and Abhishek Yeluri

Federated Scheduling for Stochastic Parallel Real-time Tasks

Scheduling selfish tasks: about the performance of truthful algorithms

Journal of Global Research in Computer Science

Real-time Scheduling of Periodic Tasks (1) Advanced Operating Systems Lecture 2

Space-efficient scheduling of stochastically generated tasks

ALG 5.4. Lexicographical Search: Applied to Scheduling Theory. Professor John Reif

Scheduling Online Algorithms. Tim Nieberg

Scheduling Lecture 1: Scheduling on One Machine

Resilient and energy-aware algorithms

Speed Scaling for Weighted Flow Time

Scheduling problem subject to compatibility constraints

CPU Scheduling Exercises

Coin Changing: Give change using the least number of coins. Greedy Method (Chapter 10.1) Attempt to construct an optimal solution in stages.

Scheduling Adaptively Parallel Jobs. Bin Song. Submitted to the Department of Electrical Engineering and Computer Science. Master of Science.

A Dynamic Programming algorithm for minimizing total cost of duplication in scheduling an outtree with communication delays and duplication

State-dependent and Energy-aware Control of Server Farm

A Note on Parallel Algorithmic Speedup Bounds

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Lecture 6. Real-Time Systems. Dynamic Priority Scheduling

Recoverable Robustness in Scheduling Problems

HYBRID FLOW-SHOP WITH ADJUSTMENT

Last class: Today: Threads. CPU Scheduling

Approximation Algorithms for Scheduling with Reservations

Single Machine Models

More Approximation Algorithms

Eliminations and echelon forms in exact linear algebra

CPU Scheduling. CPU Scheduler

Transcription:

A Tighter Analysis of Work Stealing Marc Tchiboukdjian Nicolas Gast Denis Trystram Jean-Louis Roch Julien Bernard Laboratoire d Informatique de Grenoble INRIA Marc Tchiboukdjian A Tighter Analysis of Work Stealing 1/18

Parallel programming with task parallel libraries Fib(n) { if ( n <= 1 ) return n ; else { x = spawn Fib(n-1) ; y = Fib(n-2) ; sync ; return x+y ; } } online scheduler Shared Memory C 1 C 2 C 3 C 4 m processors Work W =17 Depth D=9 The new standard for parallel programming? Cilk, Intel TBB, Microsoft TPL, KAAPI,... Marc Tchiboukdjian A Tighter Analysis of Work Stealing 2/18

Efficiently schedule task parallel programs List scheduling Greedy scheduler: when tasks are available, no processor is idle C max W ( m + 1 1 ) D m core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Tighter Analysis of Work Stealing 3/18

Efficiently schedule task parallel programs List scheduling Greedy scheduler: when tasks are available, no processor is idle C max W ( m + 1 1 ) D m core 1 core 2 core 3 core 4 Problem: contention on the list Marc Tchiboukdjian A Tighter Analysis of Work Stealing 3/18

Efficiently schedule task parallel programs List scheduling Greedy scheduler: when tasks are available, no processor is idle C max W ( m + 1 1 ) D m core 1 core 2 core 3 core 4 Problem: contention on the list Work stealing Each processor has its own list When empty, it tries to steal tasks in others lists u.a.r Contention is reduced: only when several thieves target same victim steal core 1 core 2 core 3 core 4 Marc Tchiboukdjian A Tighter Analysis of Work Stealing 3/18

Previous Work on Work Stealing Work generation is probabilist, focus on steady state results [Mitzenmacher 98, Berenbrink et al. 03] Study of the makespan on identical processors [Blumofe Leiserson 99, Arora Blumofe Plaxton 01] Extended to processors with varying speeds [Bender Rabin 02] Marc Tchiboukdjian A Tighter Analysis of Work Stealing 4/18

Work Stealing Scheduler of Arora Blumofe Plaxton running task ready task stolen task executed task steal steal thief Unit tasks, one source, out-degree at most 2 Execute depth-first and steal breadth-first Analysis based on critical path work queue... E[C max ] W m + 32 D { P C max W m + 64 D + 16 log 2 1 } ɛ ɛ pop worker Marc Tchiboukdjian A Tighter Analysis of Work Stealing 5/18 push

Why a new analysis of work stealing? Analysis of Arora, Blumofe, Plaxton DAG with only 1 source and out-degree at most 2 (does not cover independent tasks) Fixed steal policy (task at the top of the deque) Big constant factor New analysis Apply to several application models: independent tasks, ABP DAG, unrestricted DAG Can model different steal policies: standard steal, cooperative steal More accurate Marc Tchiboukdjian A Tighter Analysis of Work Stealing 6/18

Remaining of the talk 1 Proof methodology 2 Example of unit independent tasks 3 Conclusions Marc Tchiboukdjian A Tighter Analysis of Work Stealing 7/18

Proof based on load balancing processor j steals processor i w i (t) w i (t + 1) w j (t + 1) Each processor owns some amount of work w i (t) After a steal operation from processor j to processor i, some work is transfered from i to j (e.g. one half): max{w j (t + 1), w i (t + 1)} ρ w i (t) (with ρ < 1) Marc Tchiboukdjian A Tighter Analysis of Work Stealing 8/18

Potential Function Φ: Motivation Gantt chart with 25 processors and 2000 unit tasks White: execution Grey: steal Difficult to see any structure due to the random choices Potential function decreasing at each successful steal Bound number of steals S to bound C max m C max = W + S Marc Tchiboukdjian A Tighter Analysis of Work Stealing 9/18

Potential Function Φ: Definition Definition Φ(t) = 1 i m ( w i (t) w(t) m Φ represents how well the load is balanced between the lists ) 2 w(t) m w i (t) w(t) m Marc Tchiboukdjian A Tighter Analysis of Work Stealing 10/18

Potential Function Φ: Properties Φ(t) = 1 Φ = 0 = no more steals 1 i m ( w i (t) w(t) m ) 2 2 i, w i w i c = Φ = 0 c 3 Idle processor i steals half of the work of active processor j = Φ = w j 2 2 Marc Tchiboukdjian A Tighter Analysis of Work Stealing 11/18 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111

Proof Methodology 1 Compute the expected decrease of the potential in one step when α t processors are active and m α t are stealing E[Φ t Φ t+1 Φ t ] h(α t ) Φ t 2 Solve the equation to bound the number of steals S E[S] λ m log 2 Φ 0 { ( 1 )} P S λ m log 2 Φ 0 + log 2 ɛ ɛ 3 Deduce a bound on the execution time E[C max ] W m + λ log 2 Φ 0 with λ = max 1 α m m α m log 2 (1 h(α)) Marc Tchiboukdjian A Tighter Analysis of Work Stealing 12/18

Example: unit independent tasks Remainder w i (t): number of tasks on processor i at time t w(t): total number of tasks at time t steal half of the tasks if several thieves target the same victim, only one succeed First step: expected decrease of the potential Φ t = 1 i m ( Φ t = Φ t Φ t+1 = w i (t) w(t) m ) 2 = active processors 1 i m w 2 i (t) w 2 (t) m δ i (t) 1 m w 2 (t) w 2 (t) = w 2 (t) (w(t) α t ) 2 = 2α t w(t) α 2 t Marc Tchiboukdjian A Tighter Analysis of Work Stealing 13/18

Expected decrease of Φ in one step If processor i is not stolen, one unit of work is executed δ i (t) = w 2 i (t) w 2 i (t + 1) = w 2 i (t) (w i (t) 1) 2 = 2w i (t) 1 Marc Tchiboukdjian A Tighter Analysis of Work Stealing 14/18

Expected decrease of Φ in one step If processor i is not stolen, one unit of work is executed δ i (t) = w 2 i (t) w 2 i (t + 1) = w 2 i (t) (w i (t) 1) 2 = 2w i (t) 1 If processor j steals half of the work of processor i δ i (t) = wi 2 (t) wi 2 (t + 1) wj 2 (t + 1) ( = wi 2 wi (t) ) 2 ( wi (t) (t) 1 2 2 = w i 2(t) + w i (t) 1 2 ) 2 Marc Tchiboukdjian A Tighter Analysis of Work Stealing 14/18

Expected decrease of Φ in one step Expected decrease on active processor i { } E[δ i (t)] = P processor i is not stolen ( ) 2w i (t) 1 { } ( w 2 + P processor i is stolen i (t) ) + w i (t) 1 2 As there are m α t idle processors attempting to steal, { } ( P processor i is stolen = p(α t ) = 1 1 1 ) m αt m 1 Summing δ i on all active processors, we get E[ Φ t Φ t ] p(α t) 2 Φ t Marc Tchiboukdjian A Tighter Analysis of Work Stealing 15/18

Unit independent tasks: result Expected decrease of the potential in one step Solve the equation E[ Φ t Φ t ] p(α t) 2 Φ t E[S] λ m log 2 Φ 0 + m 1 with λ = Bound on the makespan 1 1 log 2 (1 + 1 e ) E[C max ] W m + λ log 2 Φ 0 + 1 W m + 3.65 log 2 W + 1 Results from simulation 2.37 log 2 W (gap: adversary choosing α t ) Marc Tchiboukdjian A Tighter Analysis of Work Stealing 16/18

Cooperative Stealing Standard steal: if several thieves target the same victim, only one succeed Cooperative steal: all thieves targeting the same victim succeed to steal some work If k processors steal processor i ( δ i (t) = w i (t) 2 wi (t) 1 ) 2 k ( wi (t) ) 2 ( 1 1 ) w i (t) 2 k + 1 k + 1 k + 1 Same analysis leads to E[C coop max ] W m + 2 log 2 (1 1 e ) log 2 W +1 W m +3.02 log 2 W +1 20% less steals Marc Tchiboukdjian A Tighter Analysis of Work Stealing 17/18

Conclusion Work stealing analysis Introduced a new technique based on a potential function Accurate Can modify the steal policy Not in the paper Improved constant factor for ABP DAG W Arora, Blumofe, Plaxton: m + 32 D Our analysis: W m + 5.5 D + 1 Our analysis also applies to weighted independent tasks and unrestricted DAG Marc Tchiboukdjian A Tighter Analysis of Work Stealing 18/18