Optimizing Performance and Reliability on Heterogeneous Parallel Systems: Approximation Algorithms and Heuristics

Size: px

Start display at page:

Download "Optimizing Performance and Reliability on Heterogeneous Parallel Systems: Approximation Algorithms and Heuristics"

Dustin Alexander
5 years ago
Views:

1 Optimizing Performance and Reliability on Heterogeneous Parallel Systems: Approximation Algorithms and Heuristics Emmanuel Jeannot a, Erik Saule b, Denis Trystram c a INRIA Bordeaux Sud-Ouest, Talence, France b BMI - Ohio State University - Columbus 43210, OH / USA c Grenoble Institute of Technology, Grenoble, France Abstract We study the problem of scheduling tasks (with and without precedence constraints) on a set of related processors which have a probability of failure governed by an exponential law. The goal is to design approximation algorithms or heuristics that optimize both makespan and reliability. First, we show that both objectives are contradictory and that the number of points of the Pareto-front can be exponential. This means that this problem cannot be approximated by a single schedule. Second, for independent unitary tasks, we provide an optimal scheduling algorithm where the objective is to maximize the reliability subject to makespan minimization. For the bi-objective optimization, we provide a (1+ɛ,1)-approximation algorithm of the Pareto-front. Next, for independent arbitrary tasks, we propose a 2, 1 -approximation algorithm (i.e. for any fixed value of the makespan, the obtained solution is optimal on the reliability and no more than twice the given makespan) that has a much lower complexity than the other existing algorithms. This solution is used to derive a (2 + ɛ, 1)-approximation of the Pareto-front of the problem. All these proposed solutions are discriminated by the value of the product {failure rate} {unitary instruction execution time} of each processor, which appears to be a crucial parameter in the context of bi-objective optimization. Based on this observation, we provide a general method for converting scheduling heuristics on heterogeneous clusters into heuristics that take into account the reliability when there are precedence constraints. The average behaviour is studied by extensive simulations. Finally, we discuss the specific case of scheduling a chain of tasks which leads to optimal results. Preprint submitted to J. of Parallel and Dist. Computing November 22, 2011

2 Keywords: Task Graphs. Scheduling, Pareto-front approximation, Reliability, Makespan, Precedence 1. Introduction With the recent development of large parallel and distributed systems (computational grids, cluster of clusters, peer-to-peer networks, etc.), it is difficult to ensure that the resources are always available for a long period of time. Indeed, hardware failures, software faults, power breakdown or resources removal often occur when using a very large number of machines. Hence, in this context, taking into account new objectives dealing with faulttolerance is a major issue. Several approaches have been proposed to tackle the problem of faults. One possible approach is based on duplication. The idea is that if one resource fails, other resources can continue to correctly execute the redundant parts of the application. However, the main drawback of this approach is a possible waste of resources. An alternative solution consists in check-pointing the computations from time to time and, in case of failure, to restart it from the last check-point [1, 2]. However, check-pointing an application is costly and may require to modify it. Furthermore, restarting an application slows it down. Therefore, in order to minimize the cost of the check-point/restart mechanism, it is necessary to provide a reliable execution that minimizes the probability of failure of this application. Scheduling an application corresponds to determine which resources will execute the tasks and when they will start. Thus, the scheduling algorithm is responsible for minimizing the probability of failure of the application by choosing the adequate set of resources that enable a fast and reliable execution. Unfortunately, as we will show in this paper, increasing the reliability implies, most of the time, an increase of the execution time (a fast schedule is not necessarily a reliable one). This motivates the design of algorithms that look for a set of trade-offs between these compromise solutions. In this paper, we study the problem of scheduling an application represented by a precedence task graph or by a set of independent tasks on heterogeneous computing resources. The objectives are to minimize the makespan and to maximize the reliability of the schedule. 2

3 In the literature, this problem has been mainly studied from a practical point of view [3, 4, 5]. We lack analysis based of well-founded theoretical studies for this problem. Some unanswered questions are the following: Is maximizing the reliability a difficult (NP-Hard) problem? Is it possible to find polynomial solutions of the bi-objective problem for special kind of precedence task graphs? Is it possible to approximate the general problem for any precedence relations? Can we build approximation schemes? How to help the user in finding a good trade-off between reliability and makespan? All these questions will be addressed in this article. More precisely we show why both objectives are contradictory and how provide approximation of the Pareto-front 1 in the case of independent tasks and task graphs (with the special case of chain of tasks). The main goal of this paper is to provide a deep understanding of the bi-criteria problem (makespan vs. reliability) we study as well as different ways to tackle the problem depending on the specificity of the input. The content and the organization of this paper are as follows. In section 2.1, we introduce the definition of reliability and makespan and some related notations. In section 2.2, we present and discuss most significant related works. In section 3, we study some basic characteristics of the bi-objective problem. In particular, we show that maximizing the reliability is a polynomial problem (Proposition 1) and is simply obtained by executing the application on the processors that have the smallest product of {failure rate} and {unitary instruction execution time} sequentially. This means that minimizing the makespan is contradictory to the objective of maximizing the reliability. Furthermore, we show that for the general case, approximating both objectives simultaneously is not possible (Proposition 2). We show 1 Intuitively, the Pareto-front is the set of best compromise solutions; any absolutely better solution being infeasible 3

4 that the number of points of the Pareto-front in the case of independent tasks can be exponential (Theorem 2) and hence it is required to be able to approximate it. In section 4.2, we study the problem of scheduling a set of independent unitary tasks (i.e. same length). For this case, we propose an optimal algorithm (Algorithm 3) for maximizing the reliability subject to makespan minimization. We also propose an (1+ɛ,1)-approximation of the Paretofront (Section 4.2.2). This means that we can provide a set of solutions of polynomial size that approximates, at a constant ratio, all the optimal makespan/reliability trade-offs. In section 4.3, we study the case of independent tasks of arbitrary length. We provide (Algorithm 4) a 2, 1 -approximation algorithm (i.e. for any fixed value of the makespan, the obtained solution is optimal on the reliability and no more than twice the given makespan) and derive a Pareto-front approximation from this algorithm (Section 4.3.2). An experimental evaluation of this algorithm is provided in Section 4.4. All the above solutions emphasize the importance of the {failure rate} by {unitary instruction execution time} product. Based on this observation, we show, in section 5.1, how to easily transform a heuristic that targets makespan minimization to a bi-objective heuristic for the case of any precedence relation (Algorithm 5). In this case also, we demonstrate how to help the users to choose a suitable makespan/reliability trade-off. We implement this methodology using two heuristics and we compare our approach against other heuristics of the literature. Moreover, in section 5.2 we study a special sub-case of precedence task graphs where all the tasks are sequentially serialized by a chain (lemma 4). Finally, we conclude the paper and discuss some challenging perspectives. 2. Preliminaries 2.1. Problem Definition As in most related studies, a parallel application is represented by a precedence task graph: let G = (T, E) be a Directed Acyclic Graph (DAG) where T is the set of n vertices (that represent the tasks) and E is the set of edges that represent precedence constraints among the tasks (if there are any). Let Q be a set of m uniform processors as described 4

5 in [6]. A uniform processor is defined as follows: processor j computes 1/τ j operations by time unit and p ij = p i τ j denotes the running time of task i on processor j (τ j is also called the unitary instruction execution time, i.e. the time to perform one operation). In the remainder of the paper, i will denote the task index while j will refer to the processors.p i denotes the processing requirement of task i. Moreover, processor j has a constant failure rate of λ j. When a processor is affected by a failure, it stops working until the end of the schedule (this model is usually called crash fault). If a processor fails before completing the execution of all its tasks, the execution has failed. A schedule s = (π, σ) is composed of two functions: a function π : T Q that maps a task to the processor that executes it and a function σ : T R that associates to each task the time when it starts its execution. We denote by π 1 the function which maps a processor to the set of tasks allocated on it; which we improperly call the inverse of function π. To be valid a schedule must satisfy the precedence constraints and no processor should execute more than a task at once. The completion time of processor j is the first time when all its tasks are completed: C j (s) = max i π 1 (j) σ(i) + p ij. The makespan of a schedule is defined as the maximum completion times C max (π) = max j C j (π). The probability that a processor j executes all its tasks successfully is given by an exponential law: P r j succ(π) = e λ jc j (π). We assume that faults are independent, therefore, the probability that schedule π finished correctly is: P r succ = Π j P r j succ(π) = e j C j(π)λ j. The reliability index is defined by rel(π) = j C j(π)λ j. When no confusion is possible, π will be omitted. We are interested in minimizing both C max and rel simultaneously (i.e. minimizing the makespan and maximizing the probability of success of the whole schedule) Related Works Optimizing single objectives. First, we discuss briefly how each single-objective problem has been studied in the literature. The minimization of the makespan is a classical problem. It is well-known that scheduling independent tasks on uniform processors in less than a fixed amount of time is a NP-complete problem because it contains PARTITION as a sub-problem which is NP-complete [7]. A low cost (2 1 )-approximation algorithm has been proposed m+1 5

6 in [8]. It consists of classical list scheduling where the longest task of the list is iteratively mapped on the processor that will complete it the soonest. Hochbaum and Shmoys proposed a PTAS(Polynomial Time Approximation Scheme) based on the bin packing problem with variable bin sizes [9]. However, this result is only of theoretical interest because its runtime complexity is far too high. The problem with precedence constraints is much less understood from the approximation theory point of view. Without communication delay, the best known approximation algorithm for arbitrary dependency graphs and uniform processor is a O(log m)-approximation proposed by [10]. The problem with communication delay is known to be difficult even on identical processors and often requires to make the distinction between small communication delays and large communication delays or hypothesis such as Unitary Execution Task [11]. It is beyond the scope of this paper to make a full review on the scheduling theory, the reader is referred to [12] for more details. Since there exists many reliability models, there exist multiple methods to minimize the reliability depending on the chosen reliability model. Some models lead to harder problems for determining the maximum reliability, the main problem is to avoid having dependent probabilistic event which prevent the existence of a useful closed formula and that often arises in schedule with replication. For instance, [13] needs to add constraint on the structure of the schedule to be able to compute the reliability in polynomial time; without this restriction, determining the reliability of a schedule would be a difficult problem [14]. We consider in this work a realistic model where the schedule with the best reliability can be computed in polynomial time (of course, the corresponding makespan may be very large). The assumption of crash faults is realistic in the sense that it corresponds to the most common case of failure: a machine goes offline. The assumption that the probability of success follows an exponential law is a direct consequence of the assumption that the failure rate is constant during the execution of the application. This assumption is reasonable since the execution time of the application is small compared to the lifetime of the cluster. Moreover, this assumption is the base of the Shatz-Wang reliability model [15] which has been used in numerous works on reliability such as [13, 3, 4, 5]. Finally, some authors studied new non-conventional objectives 6

7 like maximizing the number of tasks performed before failure [16]. Related bi-objective problems: Shmoys and Tardos studied the bi-objective problem of minimizing both the makespan and the sum of costs of a schedule of independent tasks on unrelated machines in [17]. This problem is mathematically the same as the problem of optimizing the makespan and reliability of independent tasks. In their model, the cost is induced by scheduling a task on a processor and the cost function is given by a cost matrix. They proposed an algorithm based on two parameters, namely, a target value M for the makespan and C for the cost and returns a schedule whose makespan is lower than 2M with a cost better than C. This method can be adapted to solve our problem. However, it is difficult to implement since it relies on Linear Programming and its complexity is in O(mn 2 log n) which is costly. Section will present an algorithm tailored to our case of uniform machine that is asymptotically faster by a ratio of O(nm). It is also possible to use integrated approaches where one of the objectives implicitly contains the other like the minimization of the mean makespan with check-points [18]. Here, the trade-off between doing a check-point or not is included into the expression of the mean makespan. Optimizing both makespan and reliability: several heuristics have been proposed to solve this bi-objective problem. Dogan and Ozgüner proposed in [3] a bi-objective heuristic called RDLS. In [4], the same authors improved their solution using an approach based on genetic algorithms. In [5], Hakem and Butelle proposed a bi-objective heuristic called BSA that outperforms RDLS. In [19], the authors proposed MRCD, an algorithm to compute/reliability compromise. They show that this compromise can be better than ones found by other heuristics, but contrary to this work they do not focus on the whole Pareto-front. All these results focused on the general case where the precedence task graph is arbitrary. Moreover, none of the proposed heuristics have a constant approximation ratio. This manuscript is an extended version of two works on this topic: [20] and [21]. On the theoretical side, we prove here that the Pareto Front can be exponential and we work on the case of chains of tasks and propose an optimal algorithm. On the experimental side we have added a huge set of work concerning the experimental validation of our algorithm for the independent arbitrary 7

8 task and for the different heuristics studied in the case of arbitrary task graph Preliminary Analysis The goal of our work is to solve a bi-objective problem, namely minimizing the makespan and maximizing the reliability (which corresponds to minimize the probability of failure). Unfortunately, these objectives are conflicting. More precisely, as shown in the following Proposition 1, the optimal reliability is obtained while mapping all the tasks on processor j such that, j = argmin(τ j λ j ), i.e., on the processor for which the product of {failure rate} {unitary instruction execution time} is minimal. However, from the view point of the makespan, such a schedule can be arbitrarily far from the optimal one. Proposition 1. Let S be a schedule where all the tasks have been assigned to processor j 0, in topological order, such that τ j0 λ j0 is minimal. Let rel be the reliability of the successful execution of schedule S. Then, any schedule S S, with reliability rel, is such that rel rel. Proof. Suppose without loss of generality that j 0 = 0 (i.e. j : τ 0 λ 0 τ j λ j ). Then rel = C 0 λ 0 (all the tasks are mapped to processor 0). Let call C j the completion date of the last task on processor j with schedule S. Therefore, rel m j=0 C jλ j (The inequality comes from the idle times that may appear which can be omitted here since it decreases the bound on rel and a lower bound is enough for our calculations). Let T be the set of tasks that are not executed on processor 0 by schedule S. Then, C 0 C 0 τ 0 i T p i (there are still some tasks of T \T to execute on processor 0). Let T = T 1 T 2... T m, where T j is the subset of the tasks of T executed on processor j by schedule S (these sets are disjoint: j 1 j 2, T j1 T j2 = ). Then, j, 1 j m, C j τ j i T j p i. Let us compute the 8

9 difference rel -rel: = = m j=1 = m j=0 C jλ j C 0 λ 0 C 0 λ 0 τ 0 λ 0 + ) m j=1 (τ j λ j p i Tj i C 0 λ 0 ) m j=1 (τ j λ j i T j p i τ 0 λ 0 i T p i i T p i ) ( (τ j λ j p i Tj i τ 0 λ m ) 0 j=1 i T p j i (because the T j s are disjoint) m j=1 ((τ j λ j τ 0 λ 0 ) ) i T j p i 0 (because j : τ 0 λ 0 τ j λ j ) This proposition shows that the problem of minimizing the makespan subject to the condition that the reliability is maximized corresponds to the problem of minimizing the makespan using only processors having a minimal τ j λ j. If there is only one such single processor, the problem is straightforward. In this case, the reliability is maximized only if all the tasks are sequentially executed on this processor. However, in the case when there are several processors that have the same minimal λ j τ j value, the problem is NP-Hard since it requires to minimize the makespan on all of these processors. The following proposition proves that for the problem we are interested in, there are no solutions for the bi-objective problem simultaneously that are close to each of both objectives. Proposition 2. The bi-objective problem of minimizing C max and rel cannot be approximated within a constant factor with a single solution. Proof. Consider the class of instances I k of the problem with two machines such that τ 1 = 1, τ 2 = 1/k and λ 1 = 1, λ 2 = k 2 (k R + ) and a single task t 1 with p 1 = 1. There exist only two feasible schedules, namely, π 1 in which t 1 is scheduled on processor 1 and π 2 in which it is scheduled on processor 2. Remark that π 2 is optimal for C max and that π 1 is optimal for rel. 9

10 C max (π 1 ) = 1 and C max (π 2 ) = 1/k. This leads to C max (π 1 )/C max (π 2 ) = k. This ratio goes to infinity when k goes to infinity. Similarly, rel(π 1 ) = 1 and rel(π 2 ) = k2 k leads to rel(π 2 )/rel(π 1 ) = k. Again, this ratio goes to infinity with k. = k which None of these feasible schedules can approximate both objectives within a constant factor. Proposition 2 shows that the problem of optimizing simultaneously both objectives can not be approximated. That is to say, in general, there exists no solution which is close to the optimal value on both objectives at the same time. Therefore, we will tackle the problem as optimizing one objective subject to the condition that the second one is kept at a reasonable value ([22] Chap. 3, pp. 12). For our problem, it corresponds to maximize the reliability subject to the condition that the makespan is under a threshold value. This approach may be seen as giving the priority to the makespan (the most difficult objective to optimize) and optimizing the reliability as a secondary goal. However, since finding the optimal makespan is usually NP-hard, we aim first at designing an approximation algorithm and then at determining an approximation of the Pareto-front. As the number of Pareto-optimal solutions can be exponential, it is important to be able to generate an approximation of the Pareto-front that has a polynomial size. In order to achieve this goal, we use the methodology proposed by Papadimitriou and Yannakakis in [23]. It is recalled briefly in the next section. This methodology will be used in section 4 for the case of independent tasks. 3. Bi-objective Approximation In bi-objective optimization there is no concept of absolute best solution. In general, no solution is the best on both objectives. However, a given solution may be better than another one on both objectives. It is said that the former Pareto-dominates the latter. The interesting solutions in bi-objective optimization, called Pareto-optimal solutions, are those that are not dominated by any other solutions. The Pareto-front (also called Pareto-set) of an instance is the set of all Pareto-optimal solutions. Intuitively, the Pareto- 10

11 rel y y ρ2 x ρ1 x C max Figure 1: Bold crosses are a (ρ 1, ρ 2 )-approximation of the Pareto-front. front divides the solution space between feasible and unfeasible solutions. It is the set of interesting compromise solutions and determining this set is the main target of multiobjective optimization. Unfortunately, this set is most of the time difficult to compute because one of the underlying optimization problem is NP-hard or because its cardinality is exponential. In our case, both reasons stand 2. Thus, we look for an approximation of the Pareto-front with a polynomial cardinality. A generic method to obtain an approximated Pareto-front was introduced by Papadimitriou and Yannakakis in [23]. P c is a (ρ 1, ρ 2 )-approximation of the Pareto-front P c if each solution s P c is (ρ 1, ρ 2 )-approximated by a solution s P c: s P c, s P c, C max (s) ρ 1 C max (s ) and rel(s) ρ 2 rel(s ). Fig. 1 illustrates this concept. Crosses are solutions of the scheduling problem represented in the (C max ; rel) space. The bold crosses are an approximated Pareto-front. Each solution (x; y) in this set (ρ 1, ρ 2 )-dominates a quadrant delimited in bold in the figure and whose origin is at (x/ρ 1 ; y/ρ 2 ). All solutions are dominated by a solution of the approximated Pareto-front as they are included into a (ρ 1, ρ 2 )-dominated quadrant. One possible way for building such an approximation uses an algorithm that constructs 2 We will show in the next section that the size of the Pareto-front can be exponential 11

12 a ρ 2 -approximation of the second objective constrained by a threshold on the first one. The threshold cannot be exceeded by more than a constant factor ρ 1. Such an algorithm is said to be a ρ 1, ρ 2 -approximation algorithm. More formally, Definition 1. Given a threshold value of the makespan ω, a ρ 1, ρ 2 -approximation algorithm delivers a solution whose C max ρ 1 ω and rel ρ 2 rel (ω) where rel (ω) is the best possible value of the reliability index in schedules whose makespan is less than ω. Let AP P ROX be a ρ 1, ρ 2 -approximation algorithm (For instance, Algorithm 3 and 4, we will explain later). Algorithm 1 constructs a (ρ 1 +ɛ, ρ 2 )-approximation of the Pareto-front of the problem by applying AP P ROX on a geometric sequence of makespan thresholds. The geometric sequence will only be considered between a lower bound C min max and an upper bound C max max of makespan of Pareto-optimal solutions. Algorithm 1: Pareto-front approximation (according to the method of Papadimitriou and Yannakakis) Data: ɛ a positive real number Result: S a set of solutions begin k 0 S while k log 1+ɛ/ρ1 ( Cmax ) do Cmax min ω k (1 + ɛ ρ 1 ) k Cmax min s k AP P ROX(ω k ) S S {s k } k k + 1 return S end Theorem 1. The method of Papadimitriou and Yannakakis described in Algorithm 1 builds a (ρ 1 + ɛ, ρ 2 ) approximation of the Pareto-front from a ρ 1, ρ 2 -approximation algorithm. 12

13 rel rel (ω k) AP P ROX(ω k+1) rel (ω k+1) (1 + ɛ ρ1 ) (ρ 1 + ɛ) ρ 1 ρ 2 ω k ω k+1 = (1 + ɛ ρ1 )ωk C max Figure 2: AP P ROX(ω k+1 ) is a (ρ 1 + ɛ, ρ 2 ) approximation of Pareto-optimal solutions whose makespan is between ω k and ω k+1. There is at most a factor of ρ 2 for the reliability between AP P ROX(ω k+1 ) and rel (ω k+1 ). The ratio for the makespan between AP P ROX(ω k+1 ) and ω k+1 is less than ρ 1 and ω k+1 = (1 + ɛ ρ 1 )ω k. Thus, AP P ROX(ω k+1 ) is a (ρ 1 + ɛ, ρ 2 )-approximation of (ω k, rel (ω k+1 )) Proof. Let s be a Pareto-optimal schedule. Then, there exists k N such that (1 + ɛ ρ 1 ) k C min max C max (s ) (1 + ɛ ρ 1 ) k+1 C min max. We show that s k+1 is an (ρ 1 + ɛ, ρ 2 )-approximation of s. The construction from step k to step k + 1 is illustrated in Figure 2. Reliability. rel(s k+1 ) ρ 2 rel ((1 + ɛ ρ 1 ) k+1 C min max) (by definition). s is Pareto-optimal, hence rel(s ) = rel (C max (s )). decreasing function, we have: rel(s k+1 ) ρ 2 rel(s ). But, C max (s ) (1 + ɛ ρ 1 ) k+1 C min max. Since rel is a Makespan. C max (s k+1 ) ρ 1 (1 + ɛ ρ 1 ) k+1 C min max = (ρ 1 + ɛ)(1 + ɛ ρ 1 ) k C min max (by definition) and C max (s ) (1 + ɛ ρ 1 ) k C min max. Thus, C max (s k+1 ) (ρ 1 + ɛ)c max (s ). Remark that AP P ROX(ω k ) may not return a solution (in this case we s k is set to and we increment k). However, this is not a problem because it means that no solution has a makespan lower than ω k. AP P ROX(ω k ) approximates Pareto-optimal solutions whose makespan is lower than ω k. Hence, there is no forgotten solution. The algorithm generates log 1+ ɛ ρ 1 same number of times. Cmax Cmax min solutions and calls the AP P ROX algorithm the 13

14 4. Independent tasks 4.1. Size of the Pareto-front Before proposing algorithmic solutions for the bi-objective problem, we show that it is not possible to compute the whole Pareto-front in polynomial time. More precisely, we show that the number of points of the Pareto-front can be exponential in the size of the input. Theorem 2. There exists a class of instances whose set of Pareto-optimal solutions is exponential in the number of tasks. Proof. The proof is obtained by exhibiting a class of instances with an exponential number of solutions. Let us consider instance I n composed of n tasks such that p i = 2 i 1, i, 1 i n and 2 processors where the first one is very fast and unreliable (τ 1 = 2 n, λ 1 = 1) whereas the second one is very slow but highly reliable (τ 2 = 1, λ 2 = 2 n ). The processor parameters and task sizes induce that: The makespan is only determined by the task scheduled on processor 2: C max = i π 1 (2) p i (or is equal to n i=1 2i 1 τ 1 = 2n 1 2 n 1 if all the tasks are scheduled on processor 1). The reliability is mainly determined by the tasks scheduled on processor 1: rel = i π 1 (1) p i (the contribution of the tasks on processor 2 is less than 2n 1 2 n and thus can be omitted for the sake of clarity). There are exactly 2 n solutions since each task may be scheduled either on processor 1 or 2. Each solution is uniquely described by the sum of processing times of the tasks scheduled on processor 1 which can take all the values between 0 and 2 n 1. From above, let solution π i be the schedule with a makespan of C max = i. Its reliability is rel = 2 n 1 i. All the solutions have different objective values. Moreover, the makespan strictly increases with i whereas the reliability strictly decreases. This proves that each solution is Pareto-optimal. 14

15 4.2. Independent unitary tasks Notice that when we consider only independent tasks, all the solutions are compact (i.e., they do not contain idle time) and the order of the tasks does not matter. Therefore, a solution for independent unitary tasks is entirely defined by the number of tasks allocated to each processor A 1, 1 -approximation algorithm Given a makespan objective ω, we show how to find a task allocation that is the most reliable for a set of n independent unitary tasks ( i T, p i = 1). To build a ρ 1, ρ 2 -approximation algorithm, we consider the problem of minimizing the probability of failure subject to the condition that the makespan is constrained. Since the tasks are unitary and independent, the problem is then to find for each processor j Q the number of tasks a j to allocate on processor j such that the following constraints are fulfilled: (1) j Q a j = n. (2) The makespan is constrained: j Q, a j τ j ω. This threshold ω on the makespan is assumed to be larger than the optimal makespan Cmax. (3) Subject to the previous constraints, rel is minimized, i.e., j Q a jλ j τ j is minimized. Once the allocation is known, it is easy to express a solution π such that a j = π 1 (j). First, it is important to notice that finding a schedule whose makespan is smaller than a given objective ω can be found in polynomial time. Indeed, Algorithm 2 determines the minimal makespan allocation for any given set of independent unitary tasks as shown in [24], pp Second, we propose Algorithm 3 to solve the problem. It determines an optimal allocation as proven by Theorem 3. It is a greedy algorithm that allocates the tasks to the processors in an increasing order of their λ j τ j products. Each processor receive the largest number of task while keeping the makespan less than ω. Theorem 3. Algorithm 3 is a 1, 1 -approximation. Proof. Let X be the number of tasks already assigned. Since when X < n we allocate at most n X tasks to a processor, at the end of the algorithm we have: X n (since ω Cmax, 15

16 Algorithm 2: Optimal allocation for independent unitary tasks begin end for j from 1 to m do 1/τj a j 1/τi n while a j < n do k argmin l (τ l (a l + 1)) a k a k + 1 Algorithm 3: Optimal reliable allocation for independent unitary tasks Input: ω C max begin Sort the processors by increasing λ j τ j end X 0 for j from 1 to m do if X < n then ( ) ω a j min n X, τ j else a j 0 X X + a j 16

17 at the end of the algorithm X = n, i.e. all the tasks are assigned). For each processor j we allocate at most ω τ j tasks, hence the makespan constraint is respected: a j τ j ω. Since in Algorithm 2, the order of the tasks and the order of the processors are not taken into account, Algorithm 3 is valid (i.e., all tasks are assigned using at most the m processors). Hence, the makespan of the schedule is lower than ω. We need to show that j Q a jλ j τ j is minimum. First let us remark that Algorithm 3 allocates the tasks to the processors in increasing order of the λ j τ j values. Hence, any other valid schedule π of allocation a is such that a i < a i and a j > a j for any i < j. Without loss of generality, let us assume that a 1 = a 1 k, a i = a i + k and a j = a j for k N, 1 k a i, j 1 and j i. Then, the difference between the two objective values is D = a x λ x τ x a xλ x τ x x Q x Q = λ 1 τ 1 (a 1 a 1) + λ i τ i (a i a i) = kλ 1 τ 1 + kλ i τ i = k(λ i τ i λ 1 τ 1 ) 0 because λ i τ i λ 1 τ 1. Hence, the first allocation has a smaller objective value Approximating the Pareto-front We propose below two methodologies for computing the Pareto-front based on Algorithm 3. The first technique consists in using the method of Papadimitriou and Yannakakis presented in Algorithm 1. Since Algorithm 3 is a 1, 1 -approximation algorithm, we obtain a (1+ɛ,1) Pareto-front approximation thanks to Theorem 1. In this case, the lower bound C min max = C max computed by Algorithm 2 and the upper bound C max max = nτ 1 is the makespan where all the tasks are executed on the processor that leads to the most reliable schedule (hence, the longer schedules are Pareto-dominated by this one). The time-complexity of this method is in O ( m log 1+ɛ (nτ 1 ) ) which is polynomial. 17

18 The second method consists in calling Algorithm 3 only on relevant values of ω. It leads to the question What is the smallest value of ω > ω that produces a different schedule?. ω must be large enough to allow one task scheduled on processor j to be scheduled on processor j < j instead, improving the reliability. Therefore, only the values of ω = xτ j are interesting; they correspond to the execution time of x(1 x n) tasks on processor j(1 j m). There are less than nm interesting times and thus, less than nm Paretooptimal solutions. Using Algorithm 3, the exact Pareto-front can be found in O(nm 2 ); this time-complexity is exponential in the size of the instance. Indeed the size of the instance is not n but O(log n): we only need to encode the value of n, not the n tasks as they are all identical Independent arbitrary tasks In this section, we extend the analysis to the case where the tasks are not unitary (the values p i are integers). As before, the makespan objective is fixed and we aim at determining the best possible reliability. However, since the problem of finding if there exists a schedule whose makespan is smaller than a target value, given a set of processors and any independent tasks, is NP-complete, it is not possible to find an optimal schedule unless P=NP A 2, 1 approximation-algorithm We present below a 2, 1 -approximation algorithm called CMLT (for ConstrainedMin- LambdaTau) which has a better complexity and which is easier to implement than the general algorithm presented in [17]. Let ω be the guess value of the optimum makespan. Let M(i) = {j p ij ω} be the set of processors able to execute task i in less than ω units of time. It is obvious that if i is executed on j / M(i) then, the makespan will be greater than ω. The following proposition states that if task i has less operations than task i, then all the machines able to schedule i in less than ω time units can also schedule i in the same time. The proof is directly derived from the definition of M and thus it is omitted. Proposition 3. i, i T such that p i p i, M(i ) M(i) 18

19 CMLT is presented as follows: for each task i considered in non-increasing number of operations, schedule i on the processor j of M(i) that minimizes λ j τ j with C j ω (or it returns no schedule if there is no such processor). Sorting the tasks by non-increasing number of operations implies that more and more processors are used over time. The principle of the algorithm is rather simple. However several properties should be verified to ensure that it is always possible to schedule all the tasks this way. Lemma 1. CMLT returns a schedule whose makespan is lower than 2ω or ensures that there is no schedule whose makespan is lower than ω. Proof. We need first to remark that if the algorithm returns a schedule, then its makespan is lower than 2ω (task i is executed on processor j M(i) only when C j ω). It remains to prove that if the algorithm does not return a schedule then there is no schedule with a makespan lower than ω. Suppose that task i cannot be scheduled on any processor of M(i). Then all processors of M(i) execute tasks during more than ω units of time, j M(i), C j > ω. Moreover, due to Proposition 3, each task i i such that p i > p i could not have been scheduled on a processor not belonging to M(i). Thus, in a schedule with a makespan lower than ω, all the tasks i i must be scheduled on M(i). There are more operations in the set of tasks {i i} than processors in M(i) can execute in ω units of time. Lemma 2. CMLT generates a schedule such that rel rel (ω) Proof. We first construct a non-feasible schedule π whose reliability is a lower bound of rel (ω). Then, we will show that rel(cmlt ) rel(π ). We know from Theorem 3, that the optimal reliability under the makespan constraint for unitary tasks and homogeneous processors is obtained by adding tasks to processors in (sorted in increasing order of λτ) up to reaching the threshold ω. For arbitrary lengths, we can construct a schedule π using a similar method. Task i is allocated to the processor of M(i) that minimizes the λτ product. But if i finishes after ω, the exceeding quantity is 19

20 scheduled on the next processor belonging to M(i) in λτ order. Note that such a schedule exists because CMLT returns a solution. Of course this schedule is not always feasible as the same task can be required to be executed on more than one processor at the same time. However, it is easy to adapt the proof of Theorem 3 and to how that rel(π ) rel (ω). The schedule generated by CMLT is similar to π. The only difference is that some operations are scheduled after ω. In π, these operations are scheduled on less reliable processors. Thus, the schedule generated by CMLT has a better reliability than π. Finally, we have rel(cmlt ) rel(π ) rel (ω) which concludes the proof. Remark that if ω is very large, M(i) = Q for all tasks i and hence all the tasks will be scheduled on the processor which minimizes the λτ product leading to the most reliable schedule. Lemma 3. The time complexity of CMLT is in O(n log n + m log m). Proof. The algorithm should be implemented using a heap according to what is presented in Algorithm 4. The cost of sorting tasks is in O(n log n) and the cost of sorting processors is in O(m log m). Adding (and removing) a processor to (from) the heap costs O(log m) and such operations are done m times. Heap operations cost O(m log m). Scheduling the tasks and all complementary tests are done in constant time, and there are n tasks to schedule. Scheduling operations cost is in O(n). All the results of this section are summarized in the following theorem: Theorem 4. CMLT is a 2, 1 -approximation algorithm with a complexity in O(n log n + m log m) Approximating the Pareto-front Here again we can approximate the Pareto-front using the method of Papadimitriou and Yannakakis. Thank to Theorem 1, Algorithm 1 applied on CMLT leads to a (2 + ɛ,1)- approximation of the Pareto front. 20

21 Algorithm 4: CMLT Input: ω the makespan threshold begin Sort the tasks in non-increasing p i order (now, i [1, n 1], p i p i+1 ) end Sort the processors in non-decreasing τ j order (now, j [1, m 1], τ j τ j+1 ) Let H be an empty heap j 1 for i from 1 to n do while j M(i) do Add j to H with key λ j τ j j j + 1 if H.empty() then Return no solution j H.min() schedule i on j C j C j + p i τ j if C j > ω then Remove j from H 21

22 The lower bound C min max = i p i j 1 τ j gathers the whole computational power of all the processors. is obtained by considering that a single virtual processor The upper bound C max max = i p i max j τ j is the makespan obtained by scheduling all tasks on the slowest processor. No solution can have a worse makespan without introducing idle times which are harmful for both objective functions. Notice that C max max can be achieved by a Pareto-optimal solution if the slowest processor is also the most reliable one. The last points to answer are about the cardinality of the generated set and the complexity of the algorithm. Cardinality: The algorithm generates less than log 1+ ɛ ρ 1 Cmax Cmax min log 1+ ɛ ρ 1 max i τ i j 1/τ j log 1+ ɛ 2 m max iτ i min i τ i solutions which is polynomial in 1/ɛ and in the size of the instance. Complexity: Remark that CMLT sorts the tasks in an order which is independent of ω. This sorting can be done once for all. Thus, the complexity of the Pareto-front approximation algorithm is O(n log n + log 1+ɛ/2 ( Cmax ) (n + m log m)). Cmax min In Section 2.2 we briefly recalled the work of Shmoys and Tardos done for a different bi-objective problem [17] which may also be used in our context. Using this method, we can derive a 2, 1 -approximation algorithm whose time-complexity is in O(mn 2 log n). This is larger than the time-complexity of CMLT in O(n log n+m log m). Moreover, in the perspective of approximating the Pareto-front of the problem with the method previously presented, the algorithm derived from [17] would have a time-complexity of log 1+ɛ/2 ( Cmax ) (mn 2 log n). Cmax min Unlike CMLT, this algorithm cannot be easily tuned to avoid a significant part of computations when the algorithm is called several times. Thus, CMLT is significantly better than the algorithm presented in [17] which has been established in a more general setting on unrelated processors Experimental analysis of CMLT The goal of this section is to compare the front obtained by approximation Algorithm 1 applied with CMLT with an idealized virtual front (called F ). We intent to show that this 22

23 Fail. proba. Ref. point π 5 π 4 π 3 π 2 π 1 Makespan Figure 3: The hypervolume is the set of the points that are dominated by a point of the front and that dominates the reference point. In this example, it is the blue zone. When the two objectives have to be minimized, the hypervolume should be maximized. algorithm has not only a very-good worst case guaranty as shown in Theorem 4, but has also a good behavior on average. More precisely, we use Algorithm 1 with ɛ = 10 3 applied on CMLT. The obtained result is compared to a front F composed of three points, namely, the HEFT [25] schedule (oriented to optimize the makespan), the most reliable schedule (obtained by scheduling all the tasks to the processor with the smallest λτ product) and a fictitious schedule with the same makespan as HEFT and the best reliability. Although one can find a better makespan-centric schedule than the one found by HEFT, the front F is a very good front that dominates all the fronts found by CMLT. To compare the fronts we use the Hypervolume unary indicator [26] (see Fig. 3) which considers the volume of the objective space dominated by the considered front up to a reference point. This choice is motivated by the fact that this indicator is the only unary indicator that is sensitive to any type of improvements. Hence, if a front maximizes this indicator, then it contains all the Pareto-optimal solutions. Since we target a problem of minimizing two objectives, the greater the hypervolume the better the front [26]. In our case, the hypervolume of F is always a rectangular. 23

24 2 approx vs. Inf. Bound Frequency Hypervolume ratio Figure 4: ECDF and histogram of the hypervolume ratio between the approximation algorithm front and F The input cases are the following. We consider three sets of machines with respectively 10, 20 and 50 processors. Speeds and the inverses of the failure rates are randomly generated according to an uniform distribution. We generate sets of tasks with cardinality between 10 and 100 (by increment of 1). For each set of tasks we draw the processing requirement uniformly between 1 and 100 (resp. 10 4, 10 6 and 10 9 ) for sets of class A (resp. B, C and D). For each set and class of tasks, 4 different seeds were used. In Fig. 4, we plot the empirical cumulative distributed function (ECDF) and the histogram of the ratio between the hypervolume of the two fronts for all the input cases (the higher the ratio the closer the approximation algorithm front to F ). From this figure, we see that the ratio is never lower than 0.6 and the median is 0.94 and 2/3 of the cases have a ratio greater than 0.9. This means that the (2 + ɛ, 1)-approximation algorithms gives very good fronts on average: in most of the cases, the obtained fronts are very close to the optimal ones. 24

25 5. Precedence Task Graphs 5.1. Arbitrary Graphs In this section, we study the general case where there is no restriction on the precedence task graph. We present three ways of designing bi-objective heuristics from makespan centric ones. The first one is based on the characterization of the role of the λτ product {failure rate} {unitary instruction execution time}. The second one uses aggregation to change the allocation decision in the list-based makespan-optimizing heuristic. The third one, called geometric, selects the solution that follows the best a given direction in the objectives space The case of communication When dealing with regular task graph, edges model communication. In this case failure of the network can also have an impact on the reliability. We could tackle this problem by considering the network as a new resource like in [27, 5]. However, a simpler way to consider this problem is to incorporate the network and the CPU into one entity called a node 3. As we only consider failstop error, a node has to be up from the start of the application to its end. We assume that each node has a unique dedicated link to a fail-free network backbone. If for a schedule π, node j is used during C j (π), this means that both the network and the CPU and the network must work. Let call λ c j, λ n j, and λ l j the failure rate of the CPU and the network card and the network link of node j the probability that the three are up is therefore e λc j Cj(π) e λn j Cj(π) e λl j Cj(π) = e (λc j +λn j +λl j )Cj(π). This means the node has a failure rate which is the sum of the failure rate of its CPU, its network card and its network link to the fail-free backbone. Therefore, in the following we will call λ j the failure rate of the whole node in order to take into account the CPU and the network failures Approximating the Pareto-front Using a Makespan-Centric Heuristic Both for the unitary and non-unitary independent tasks we have shown that scheduling tasks on the nodes with the smallest λτ helps in improving the reliability. Therefore, in order 3 In the remaining, the term node is used to encompass both the CPU and the network card. 25

26 to approximate the Pareto-front we propose a heuristic, called GPFA (General Pareto-Front Approximation), which is detailed in Algorithm 5 below. Algorithm 5: GPFA a General heuristic for approximating the Pareto-front Input: H a makespan centric heuristic Data: G the input DAG Result: S an approximation of the Pareto-front begin Sort the nodes in non-decreasing λ j τ j order S for j from 1 to m do Let π j be the schedule of G obtained by H using the first j nodes return S end if π j is not dominated by any solutions of S then S S {π j } The idea is to build a set of makespan/reliability trade-offs by scheduling the tasks on a subset of nodes (sorted by non-decreasing λτ product) using a makespan centric heuristic. The smaller the number of used nodes the larger the makespan and the better the reliability (and vice-a-versa). We can use any makespan centric heuristics to implement this strategy such as HEFT [25], BIL [28], PCT [29], GDL [30], HSA [5] or CPOP [25] Bi-objective Aggregation-based Heuristic The class of heuristics based on aggregation uses an additive function to combine objectives. As in [5], we use the following function. Given a ranking of the tasks, the heuristic schedules task i to the node j such that: ( ) 2 ( end(i, j) p i τ j λ j α + (1 α) max j end(i, j ) max j p i τ j λ j is minimized, where, end(i, j) is the completion time of task i if it is scheduled as soon as possible on node j and α is parameter given by users that determines the tradeoff between 26 ) 2

27 each objective (α = 1 leads to a makespan-centric heuristic). Each term represents one of the objective and is normalized since all objectives are expressed in different units and can have different orders of magnitude. The normalization is done relatively to an approximation of the worst allocation of the tasks Bi-objective Geometric-based Heuristic Concerning the geometric class of heuristics, the idea has been introduced in [31] and is described below. The user provides an angle θ between 0 an 90 and a greedy scheduling algorithm. Intuitively, θ is the direction in the objective space, the user wants to follow. A value close to 0 means that the user favors the Makespan while a value close to 90 means the opposite. At each step, a partial schedule S is constructed and a new task is considered. The algorithm simulates its execution on all the m nodes and hence, it generates m partial schedules, each one having its own reliability and makespan. Among these schedules, we discard the Pareto dominated ones. Then, these partial schedules and S the one generated at the previous step are plotted into a square of size 1, S being at the origin (see Fig 5). Then, a line determined by the origin and an angle θ with the x-axis is drawn. The closest partial schedule to this line is retained (s 2 in the figure) and we proceed the next step Experimental Settings We compare experimentally the three ways of designing bi-objective heuristics from makespan centric ones by implementing them on HEFT and HSA. Therefore, GPFA is used to derive P-HEFT and P-HSA, the aggregation scheme is used to derive B-HEFT and B-HSA 4, and the geometric construction is used to derive G-HEFT and G-HSA. We have used 3 types of graphs: the Strassen DAG [32] and 2 random graphs namely samepred (each created node can be connected to any other existing nodes) and layrpred (where the nodes are arranged by layers). We have used the following parameters to build the graphs: Number of tasks: 10, 100, 1000 for random graphs or 23, 163, 1143 for Strassen DAGs. 4 notice that this heuristic was first proposed By Hakem and Butelle and is called BSA in [5] 27

28 Fail. proba. s 6 s 7 s 3 s 4 s 1 s 10 s 9 s 2 S Θ s 5 s 8 Makespan Figure 5: The geometric heuristic with 10 nodes: star are Parto-optimal solutions and crosses are dominated solution and are discarded. Hence partial schedule s 2 is selected and the task is mapped on node 2. Average task cost of the p i s (in FLOP), for random graphs: 10 6, 10 7 or 10 9 (fixed by structure for Strassen). Variation of the task costs: 0.5, , 0.1, 0.3, 1 or 2 for random graphs (fixed by structure for Strassen). These numbers, combined with the average costs are used to compute the standard deviation of the Gamma distribution used to draw the task cost (we use a Gamma distribution because it is a positive function that is commonly used to model timings). In this case, the standard deviation is computed by multiplying the average cost by the variation. Average communication cost (Byte) 10 3, 10 4 or 10 6 for random graphs (fixed by structure for Strassen). Variation of communication costs: 0.5, , 0.1, 0.3, 1 or 2 for random graphs (fixed by structure for Strassen). Here again, the variation is used combined with the average cost to compute the standard deviation of the distribution. 28

Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines

Bi-objective approximation scheme for makespan and reliability optimization on uniform parallel machines Emmanuel Jeannot 1, Erik Saule 2, and Denis Trystram 2 1 INRIA-Lorraine : emmanuel.jeannot@loria.fr