Adaptive Checkpointing in Dynamic Grids for Uncertain Job Durations

Similar documents
The Substring Search Problem

Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

FUSE Fusion Utility Sequence Estimator

Identification of the degradation of railway ballast under a concrete sleeper

ASTR415: Problem Set #6

A Comparative Study of Exponential Time between Events Charts

Energy Savings Achievable in Connection Preserving Energy Saving Algorithms

A DETRMINISTIC RELIABILITY BASED MODEL FOR PROCESS CONTROL

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

A scaling-up methodology for co-rotating twin-screw extruders

Bayesian Analysis of Topp-Leone Distribution under Different Loss Functions and Different Priors

Surveillance Points in High Dimensional Spaces

Goodness-of-fit for composite hypotheses.

DIMENSIONALITY LOSS IN MIMO COMMUNICATION SYSTEMS

CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

Encapsulation theory: the transformation equations of absolute information hiding.

Safety variations in steel designed using Eurocode 3

Interaction of Feedforward and Feedback Streams in Visual Cortex in a Firing-Rate Model of Columnar Computations. ( r)

arxiv: v2 [physics.data-an] 15 Jul 2015

Determining solar characteristics using planetary data

Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline. Machine. Learning. Problems. Measuring. Performance.

LINEAR AND NONLINEAR ANALYSES OF A WIND-TUNNEL BALANCE

Analytical Expressions for Positioning Uncertainty Propagation in Networks of Robots

3.1 Random variables

Estimation of the Correlation Coefficient for a Bivariate Normal Distribution with Missing Data

Appraisal of Logistics Enterprise Competitiveness on the Basis of Fuzzy Analysis Algorithm

Experiment I Voltage Variation and Control

EM Boundary Value Problems

Information Retrieval (Relevance Feedback & Query Expansion)

A Probabilistic Approach to Susceptibility Measurement in a Reverberation Chamber

Truncated Squarers with Constant and Variable Correction

F-IF Logistic Growth Model, Abstract Version

Some technical details on confidence. intervals for LIFT measures in data. mining

MATH 415, WEEK 3: Parameter-Dependence and Bifurcations

A NEW VARIABLE STIFFNESS SPRING USING A PRESTRESSED MECHANISM

I. Introduction to ecological populations, life tables, and population growth models

Multiple Criteria Secretary Problem: A New Approach

On Utilization of K-Means for Determination

International Journal of Mathematical Archive-3(12), 2012, Available online through ISSN

ac p Answers to questions for The New Introduction to Geographical Economics, 2 nd edition Chapter 3 The core model of geographical economics

ANALYSIS OF A MODEL OF NUTRIENT DRIVEN SELF-CYCLING FERMENTATION ALLOWING UNIMODAL RESPONSE FUNCTIONS. Guihong Fan. Gail S. K.

MODULE 5 ADVANCED MECHANICS GRAVITATIONAL FIELD: MOTION OF PLANETS AND SATELLITES VISUAL PHYSICS ONLINE

Analysis and Optimization of a Special Type of Dielectric Loaded Resonant Cavity for Mobile Communication Filters

Gradient-based Neural Network for Online Solution of Lyapunov Matrix Equation with Li Activation Function

State tracking control for Takagi-Sugeno models

2. The Munich chain ladder method

Adaptive Gravitational Gossip: A Gossip- Based Communication Protocol with User- Selectable Rates

APPLICATION OF MAC IN THE FREQUENCY DOMAIN

Rotor Blade Performance Analysis with Blade Element Momentum Theory

The Millikan Experiment: Determining the Elementary Charge

Analytical time-optimal trajectories for an omni-directional vehicle

Uncertainty in Operational Modal Analysis of Hydraulic Turbine Components

TESTING THE VALIDITY OF THE EXPONENTIAL MODEL BASED ON TYPE II CENSORED DATA USING TRANSFORMED SAMPLE DATA

Absorption Rate into a Small Sphere for a Diffusing Particle Confined in a Large Sphere

Control Chart Analysis of E k /M/1 Queueing Model

Prediction of Motion Trajectories Based on Markov Chains

Pulse Neutron Neutron (PNN) tool logging for porosity Some theoretical aspects

Hydroelastic Analysis of a 1900 TEU Container Ship Using Finite Element and Boundary Element Methods

RANSAC for (Quasi-)Degenerate data (QDEGSAC)

Research Article Robust Evaluation for Transportation Network Capacity under Demand Uncertainty

Ben Juurlink, Cor Meenderinck Amdahl's law for predicting the future of multicores considered harmful

4/18/2005. Statistical Learning Theory

CALCULATING THE NUMBER OF TWIN PRIMES WITH SPECIFIED DISTANCE BETWEEN THEM BASED ON THE SIMPLEST PROBABILISTIC MODEL

IN SITU SOUND ABSORPTION COEFFICIENT MEASUREMENT OF VARIOUS SURFACES

Centripetal Force OBJECTIVE INTRODUCTION APPARATUS THEORY

Stochastic Analysis of Periodic Real-Time Systems *

Bifurcation Analysis for the Delay Logistic Equation with Two Delays

A Deep Convolutional Neural Network Based on Nested Residue Number System

Fresnel Diffraction. monchromatic light source

Absolute Specifications: A typical absolute specification of a lowpass filter is shown in figure 1 where:

arxiv: v1 [quant-ph] 15 Nov 2018

ANA BERRIZBEITIA, LUIS A. MEDINA, ALEXANDER C. MOLL, VICTOR H. MOLL, AND LAINE NOBLE

Topic 5. Mean separation: Multiple comparisons [ST&D Ch.8, except 8.3]

Efficient Algorithms for Adaptive Influence Maximization

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

Secret Exponent Attacks on RSA-type Schemes with Moduli N = p r q

Alternative Tests for the Poisson Distribution

Hypothesis Test and Confidence Interval for the Negative Binomial Distribution via Coincidence: A Case for Rare Events

NOTE. Some New Bounds for Cover-Free Families

Contact impedance of grounded and capacitive electrodes

Basic Bridge Circuits

A Bijective Approach to the Permutational Power of a Priority Queue

Bayesian Congestion Control over a Markovian Network Bandwidth Process

COMPUTATIONS OF ELECTROMAGNETIC FIELDS RADIATED FROM COMPLEX LIGHTNING CHANNELS

CENTER FOR MULTIMODAL SOLUTIONS FOR CONGESTION MITIGATION (CMS)

arxiv: v2 [astro-ph] 16 May 2008

Duality between Statical and Kinematical Engineering Systems

An Actuarial Approach for Aggregate Loss Assessment of the Critical Infrastructure Due to Natural Disasters

Introduction to Nuclear Forces

Functions Defined on Fuzzy Real Numbers According to Zadeh s Extension

Rate Splitting is Approximately Optimal for Fading Gaussian Interference Channels

Temporal-Difference Learning

To Feel a Force Chapter 7 Static equilibrium - torque and friction

Bifurcation Routes and Economic Stability Miloslav S. Vosvrda

Using Laplace Transform to Evaluate Improper Integrals Chii-Huei Yu

Directed Regression. Benjamin Van Roy Stanford University Stanford, CA Abstract

Residual Modes on Non-linear Resonant Decay Method

Obtaining the size distribution of fault gouges with polydisperse bearings

School Timetabling using Genetic Search

Transcription:

Adaptive Checkpointing in Dynamic Gids fo Uncetain Job Duations Maia Chtepen, Bat Dhoedt, Filip De Tuck, Piet Demeeste NTEC-BBT, Ghent Univesity, Sint-Pietesnieuwstaat 41, Ghent, Belgium {maia.chtepen, bat.dhoedt, fillip.detuck, piet.demeeste}@intec.ugent.be Filip H.A. Claeys MOSTfoWATER NV, Koning Leopold -laan 2, Kotik, Belgium fc@mostfowate.com Pete A. Vanolleghem modeleau, Univesité Laval, Québec, Qc, G1K 7P4, Canada pete.vanolleghem@gci.ulaval.ca Abstact. Adaptive checkpointing is a elatively new appoach that is paticulaly suitable fo poviding fault-toleance in dynamic and unstable gid envionments. The appoach allows fo peiodic modification of checkpointing intevals at un-time, when additional infomation becomes available. n this pape an adaptive algoithm, named MeanFailueCP+, is intoduced that deals with checkpointing of gid applications with execution times that ae unknown a pioi. The algoithm modifies its paametes, based on dynamically collected feedback on its pefomance. Simulation esults show that the new algoithm pefoms even bette than adaptive appoaches that make use of exact infomation on ob execution times. Keywods. Gid computing, fault-toleance, adaptive checkpointing. 1. ntoduction Fault-toleance is an impotant issue in the domain of gid computing, since gids ae composed of highly distibuted, decentally managed and thus potentially uneliable esouces. Application (ob) checkpointing is a technique that is commonly applied to povide fault-toleance in gids. The efficiency of this technique stongly depends on a good choice of a checkpointing inteval: an ovely shot checkpointing inteval leads to a lage numbe of edundant checkpoints, which delay ob pocessing by consuming computational and netwok esouces; on the othe hand, when a checkpointing inteval is too long, a substantial amount of wok has to be edone in case of a esouce failue. The optimal length of a checkpointing inteval is howeve extemely had to detemine befoe un-time, when no exact knowledge on ob and gid paametes is available (ob execution time, esouce failue patten, etc.). Futhemoe, chaacteistic to gid paametes is that they can dynamically change ove time, which implies that even if an appopiate checkpointing inteval is initially chosen, the pefomance of a static checkpointing algoithm, that elies on this choice, will degade ove time. To deal with this issue, eseach in the checkpointing aea has ecently tuned its attention to adaptive checkpointing solutions [2-7]. The latte allow fo dynamic modifications of an initial checkpointing inteval, as moe infomation becomes available on gid wokload and system paametes. n this pape a new adaptive checkpointing appoach, named MeanFailueCP+, is intoduced. MeanFailueCP+ is designed to opeate in absence of exact infomation on ob length. The algoithm avoids unnecessay checkpointing by modifying its intenal paametes in function of dynamically collected feedback on the system pefomance. We compae the pefomance of the new algoithm against the pefomance of in ou pevious wok intoduced adaptive solution (MeanFailueCP) [1]. MeanFailueCP is designed to modify (incease o decease) a ob checkpointing inteval as a function of mean failue fequency of esouces whee the ob is being executed, and the total ob execution time. The main disadvantage of this algoithm is that it elies on the assumption that the exact ob length can be

povided in advance, while fo most existing eal-wold applications this cannot be taken fo ganted. Futhemoe, ou ecent eseach has shown that MeanFailueCP, while significantly outpefoming peiodic checkpointing, still intoduces a consideable amount of edundant state savings. On the othe hand, MeanFailueCP+ not only weakens the equiement fo the exact ob duation to be known in advance, but also futhe educes the checkpointing ovehead. This pape is oganized as follows: Section 2 gives an oveview of elated wok; Section 3 summaizes the opeation of MeanFailueCP; in Section 4 the MeanFailueCP pefomance is evaluated; Section 5 discusses MeanFailueCP+; MeanFailueCP+ is evaluated in Section 6; and, finally, Section 7 concludes the pape. 2. Related Wok n [7] an on-line checkpointing algoithm is poposed that can be seen as a pedecesso of moden adaptive solutions. The algoithm uses on-line knowledge of the cuent cost of a checkpoint when it decides whethe o not checkpointing has to be pefomed. The main idea behind the algoithm is to look fo points in an application in which its state size is small and in which placing a checkpoint is the most beneficiay. n these points checkpointing is pefomed fequently, while in points with high cost, long checkpointing intevals ae used. An obvious disadvantage of this appoach is that it does not take into the account the esouce failue patten. n [4] and [5] the so-called coopeative checkpointing concept is intoduced, which addesses system pefomance and obustness issues by allowing the application pogamme, the compile and the un-time system to ointly decide on the necessity of each checkpoint. The algoithm poposed in ou pape is also based on this concept and thus can be seen as an coopeative (adaptive) heuistic. n [6] adaptive checkpointing is applied fo fault detection and ecovey. Ovehead is educed by diffeentiating fequencies of occuence of stoe checkpoints (SCPs) and compae checkpoints (CCPs). The disadvantage of this scheme is that it equies accuate infomation on emaining ob execution time and the expected emaining numbe of failues befoe ob temination. [2], in tun, consides only dynamic checkpointing inteval eduction in case it leads to computational gain, which is quantified by the sum of the diffeences between the means fo fault-affected and fault-unaffected ob esponse times. n [3] yet anothe adaptive fault management scheme (FT-Po) is discussed. FT- Po combines adaptive checkpointing with poactive pocess migation. The appoach optimizes application execution time by consideing the failue impact and the pevention costs. FT-Po suppots thee pevention actions: skip checkpoint, take checkpoint and migate. An adaptation manage selects an appopiate action in esponse to failue pediction. The effectiveness of FT-Po stongly depends on the quality of this pediction. 3. MeanFailueCP t i C RE < MF < α E : = = 2 C R t i C RE > MF RE MF <α E < α E : = 2 Figue 1. Opeation of MeanFailueCP on a esouce unning a single ob = MeanFailueCP is an adaptive algoithm that dynamically modifies the initially specified checkpointing inteval to optimize the numbe of checkpoints taken and thus to educe the computational ovehead. The size of the adopted checkpointing inteval ( ) is detemined by the cuently emaining ob execution time (RE ) and the aveage failue inteval (MF ) of the esouce whee the ob is assigned. Opeation of the algoithm is visualized in Fig. 1. MeanFailueCP is fist activated afte a shot time peiod t i (defined by the end-use) afte the beginning of ob execution (Step 1). Ealy activation of the algoithm opens the possibility to modify the checkpointing inteval at an ealy stage of ob pocessing. n each iteation the algoithm checkpoints the ob state and detemines the timestamp fo the next checkpointing event as follows: f RE < MF and < α E, whee α is a use-specified paamete and E is the total execution time of the ob on the esouce : the checkpointing inteval is inceased new = old +, whee is the length of the initial checkpointing inteval povided by the end-use (Step 2). The fist condition leads to eduction of checkpointing ovehead fo sufficiently stable

esouces o almost finished obs. The second condition pevents excessive gowth of, compaed to the ob length. f RE > MF o α E : the checkpointing inteval is deceased new = old (Step 3). When educing the checkpointing inteval, the following constaint should be taken into account: C < β E new, whee β < 1 is a use-defined value that secues that the time inteval between consecutive checkpoints neve deceases below the time ovehead added to a ob execution time by each checkpoint (C). Expeiments have shown that to pevent undesiably steep deceases of the checkpointing inteval, the value assigned to β should be at least 0.01, o 1% of a ob length. Finally, modifying values of by ensues fast achievement of (sub)optimal checkpointing fequency in most distibuted envionments. 4. Pefomance Evaluation of MeanFailueCP Pobability (%) 0.4 0.3 0.2 0.1 Pobability Density Function Dev = 10 Dev = 100 Dev = 1,000 0 50.0 55.0 60.0 65.0 70.0 Job Length (min) Figue 2. Pobability density function of ob length distibution fo example values of deviation (Dev) paamete MeanFailueCP assumes that the exact ob length is known befoehand. Howeve, thee ae two poblems with this assumption. Fist of all, it seems to be inapplicable fo a lage goup of the eal-wold applications, fo which only a vey ough estimation of the total ob length can be povided in advance. Secondly, ecent simulation expeiments show that knowledge of the exact ob length does not necessaily lead to the bette algoithm s pefomance. The latte, is a consequence of the fact that MeanFailueCP does not geneate the optimal numbe of checkpoints, which leads to some edundancy. By caefully calibating the algoithms paametes, this edundancy can be eliminated to the lage extent. As opposed to [1], in this section we evaluate the influence of the quality of ob length estimates on the pefomance of MeanFailueCP. Using the discete event gid simulation envionment, called DSiDE [1], we model a heavily loaded dynamic gid consisting of 128 computational esouces, equally spead ove 4 globally distibuted sites. Jobs submitted to the consideed gid have a nomally distibuted length with an aveage of 1 hou and a standad deviation vaying as shown in Fig. 2. The checkpointing ovehead C vaies fom 2 to 5 s and the data size of each checkpoint, which is tansfeed ove the netwok to a single checkpointing seve, is 10 MB. α and β ae espectively initialized with 2 and 0.01. Two simulation scenaios ae consideed: in the fist scenaio gid esouces ae assumed to be highly unstable (c.f. desktop gid), with the dynamics of the failue occuence modeled by means of a Weibull distibution with the shape paamete k = 1800 (30 min) and the scale paamete λ = 0.7; in the second scenaio failues happen less fequently (k = 10800, λ = 0.7), which means that obs have high pobability to execute without being distubed by a failue. Fo both scenaio s we obseve the pefomance of MeanFailueCP when E is eithe calculated using the exact ob length, o the aveage length ove all submitted obs, o a cetain deviation fom this aveage. Fig. 3 and Fig. 4 show fo the unstable gid the numbe of successfully executed obs and the aveage numbe of checkpoints saved pe ob, fo vaying pobability density functions of ob length distibution. Fig. 5 and Fig. 6 show the same paametes fo the second simulation scenaio. The deviation fom the aveage ob length is depicted in the figues with + and signs, whee, fo example avg-30%, means that the length of the submitted obs was assumed to be the aveage ove all obs deceased with 30%. The simulation esults show that MeanFailueCP does not necessaily pefom bette fo the exact ob length. Fo instance, in case of highly unstable esouces (see Fig. 3 and Fig. 4), thee is a elatively lage set of appoximation values fo which the algoithm pefoms ust as good o even bette. n the example at hand, the system pefomance impoves with 10% when the length of the submitted obs is assumed to be twice as high as the aveage value. This can be explained by the fact that the assumed ob length in combination with the mean failue fequency of esouces futhe optimizes the numbe of checkpoints

pefomed, compaed to the exact algoithm. Howeve, as can be seen in the figues, when the numbe of checkpoints taken keeps educing, the pefomance of MeanFailueCP consideably degades. Simulation expeiments have shown that thee can be seveal (sub)optima, on the positive and on the negative side of the aveage, howeve one of them, if any, always lies on the positive side. n geneal, a decease in the assumed ob length below the aveage value leads to a apid decease in numbe of checkpoints, since in that case the equation RE < MF almost always evaluates to tue, which esults in the gowth of the checkpointing inteval. Actually, the eason fo the above descibed behavio of the algoithm lies by the imposed limitations on the gowth/decease of checkpointing intevals. Fo instance, the equation β E new ensues that even in case of much oveestimated ob length and fequent failue, which would nomally lead to exaggeated checkpointing, the inteval is limited to a pecentage of the pedicted ob length. flexible than peiodic checkpointing [1], in unstable gids, the algoithm is still subect to futhe pefomance impovement. On the othe hand, when the gid system is stable (see Fig. 5 and Fig. 6) MeanFailueCP pefoms moe o less simila fo all consideed values fo a ob length. This is the esult of oveall limited checkpointing, esulting fom long failue-fee intevals, and the educed effect of failue on the system pefomance. Clealy, the optimal ob length pediction depends on seveal paametes, such as the length of the failue-fee inteval, limits on the checkpointing inteval, checkpointing ovehead etc. t is not only had to collect a eliable estimation fo these paametes befoehand, but also the actual values of the consideed paametes will pesumably change ove time, which undemines the usability of the static estimates. Theefoe, in the following section we intoduce MeanFailueCP+ that pefom dynamic seach of the optimal ob length estimation, using un-time infomation on the system pefomance. 2000 2700 # Jobs 1600 1200 800 Standad Deviation MFCP(aveage) # Jobs 2500 2300 2100 1900 1700 Standad Deviation MFCP(aveage) Figue 3. Aveage numbe of obs executed by MeanFailueCP, with vaying ob length estimation, in an unstable gid Figue 5. Aveage numbe of obs executed by MeanFailueCP, with vaying ob length estimation, in a stable gid 80 Aveage Numbe of Checkpoints 35 Aveage Numbe of Checkpoints # Checkpoints 60 40 20 0 Standad Deviation MFCP(aveage) # Checkpoints 30 25 20 15 10 5 0 Standad Deviation MFCP(aveage) Figue 4. Aveage numbe of checkpoints saved pe ob by MeanFailueCP, with vaying ob length estimation, in an unstable gid The above esult suggests that despite the fact that MeanFailueCP is moe efficient and Figue 6. Aveage numbe of checkpoints saved pe ob by MeanFailueCP, with vaying ob length estimation, in a stable gid 5. MeanFailueCP+

A typical gid application geneates batches of simila obs, which ae moe o less simultaneously submitted to the gid fo pocessing. Theefoe, opposite to MeanFailueCP, which egads individual obs and equies thei exact length to be known in advance, MeanFailueCP+ opeates on ob batches and needs only ough initial ob length estimation (L b ) to be povided by an end-use. Obsevation of eal-wold application leads us to the conclusion that speading of ob lengths within a single batch can be appoximated by a nomal distibution. The aveage of this distibution can be deived fom histoical infomation on pevious application uns and utilized fo initialization of L b. To optimize the system thoughput, MeanFailueCP+ monitos dynamically the numbe of obs pocessed duing a monitoing inteval of pedefined length M b and based on this feedback modifies subsequent ob length estimates (L b ) in such a way that the checkpointing ovehead is minimized without significantly penalizing the system faulttoleance. The length of the inteval M b should be chosen in function of L b. Simila to MeanFailueCP, MeanFailueCP+ is fist activated afte a shot time inteval t i afte the beginning of the ob execution and is aftewads called each time expies. The algoithm poceeds as follows: f T c T m < M b, whee T c is the cuent time and T m stands fo the begin time of the last monitoing inteval: MeanFailueCP is un with E = L b = L b. f T c T m M b and N M < 2, whee N M is the numbe of monitoing intevals aleady elapsed: MeanFailueCP+ slightly inceases the ob length estimation with a small andomly chosen value, called deviation value (D), which is in ou case set to 0.1: L b = L b + L b D. Aftewads, MeanFailueCP is executed with a new value fo E = L b. The gadual incease of L b allows the algoithm to exploe othe estimations of ob length and to escape fom an eventual local maximum. n the following phase the algoithm evaluates the effect of this slight incease in ob length, howeve, at this point in its execution thee is still insufficient pefomance data collected to pefom the evaluation (N M < 2). f T c T m M b and N M > 2: pefomance of the algoithm ove the past two monitoing intevals is evaluated. Each time a ob successfully teminates its execution, the ob count of the cuent monitoing inteval is incemented. n this phase, the algoithm compaes the numbe of obs executed duing the last monitoing inteval (N JL ) against the numbe of obs executed duing the last but one monitoing inteval (N JLBO ). f N JL = N JLBO, the deviation value is again slightly incemented D = D + 0.1, togethe with the estimated ob length L b = L b + L b D. f N JL > N JLBO, it means that ecent changes have positive effect on the algoithm s pefomance. Theefoe, we again incease the deviation pecentage D = D + 0.1. Aftewads, the new value of D is compaed against the pefomance incease P = (N JL N JLBO ) (N JLBO 0.01). f P > D, L b is modified as follows L b = L b + L b P, othewise L b = L b + L b D. This opeation ensues that the incease in the estimated ob length is at least popotional to the achieved pefomance incease. Finally, if N JL < N JLBO, it means that the cuent value of L b is too high and has to be educed. The size of the eduction is chosen to be popotional to the decease in the pefomance: L b = L b L b ((N JLBO N JL ) (N JLBO 0.01)). Once optimal values of L b and M b ae found fo a paticula application, they can be saved to be used fo the following application uns. 6. Pefomance Evaluation of MeanFailueCP+ We evaluate the pefomance of MeanFailueCP+ in the simulated gid envionment descibed in Section 4. The initial ob length estimation L b is set to 1 hou, o the aveage length of all submitted obs, and the monitoing inteval M b is consequently initialized with 30 min, 1 hou and 2 hous. Fig. 7 and Fig. 8 show the simulation esults fo two vaying fequencies of gid failue scenaios. Fo compaison, next to the numbe of successfully pocessed obs by MeanFailueCP+, the figue depicts the numbe of obs pocessed by MeanFailueCP with the exact ob lengths. Also the best esult (MFCP, avg+100%), achieved in Section 4, is pesented in the figues. n the case of highly unstable gids, MeanFailueCP+, with the monitoing inteval equal to the aveage ob length, leads to the best ob thoughput. Howeve, MeanFailueCP+ whee M b initialized with diffeent values, also pefoms bette than MeanFailueCP. As can be expected, within a stable gid the benefit of MeanFailueCP+ is less significant.

# Jobs 2000 1900 1800 1700 1600 1500 1400 MFCP+(0,5h) MFCP+(1h) MFCP+(2h) the exact ob length is an ovely stict equiement, which does not necessaily lead to optimal algoithm pefomance. Simulation esults show that MeanFailueCP+, without a pioi knowledge of the exact ob length, inceases gid thoughput with up to 10%, compaed to the thoughput of MeanFailueCP, initialized with exact values. 1300 Standad Deviation 8. Refeences Figue 7. Numbe of obs successfully executed by MeanFailueCP+, with vaying monitoing inteval, in an unstable gid # Jobs 2700 2500 2300 2100 1900 1700 1500 Standad Deviation MFCP+(0,5h) MFCP+(1h) MFCP+(2h) Figue 8. Numbe of obs successfully executed by MeanFailueCP+, with vaying monitoing inteval, in a stable gid 7. Conclusion Adaptive ob checkpointing is a highly suitable technique to povide fault-toleance in heteogeneous and decentally managed gids. ts main advantage is that it allows fo dynamic modification of checkpointing intevals in function of application and system paametes collected at un-time. This pape intoduces an adaptive checkpointing algoithm named MeanFailueCP+ that opeates in absence of infomation on the total ob duation. The algoithm initially equies only a ough estimation of the ob length, which is modified at un-time, based on dynamically collected infomation on the algoithm s pefomance. We compae the pefomance of this new feedbackbased appoach against the pefomance of an adaptive checkpointing algoithm, named MeanFailueCP. MeanFailueCP detemines an appopiate checkpointing fequency based on ob execution time and esouce failue fequency. MeanFailueCP imposes, howeve, that the exact ob length is known befoe untime. n this pape we show that knowledge of [1] Chtepen M, Claeys F.H.A, Dhoedt B, De Tuck F, Demeeste P, Vanolleghem P.A. Adaptive Task Checkpointing and Replication: Towad Efficient Fault-Tolaant Gids. EEE Tansactions on Paallel and Distibuted Systems 2009; 20(2): 180-190. [2] Katsaos P, Angelis L, Lazos C. Pefomance and Effectiveness Tade-off fo Checkpointing in Fault-Toleant Distibuted Systems. Concuency and Computation: Pactice & Expeience 2007; 19(1): 37-63. [3] Lan Z, Li Y. Adaptive Fault Management of Paallel Applications fo High-Pefomance Computing. EEE Tansactions on Computes 2008; 57(12): 1647-1660. [4] Oline A, Rudolph L, Sahoo R. Coopeative Checkpointing: a Robust Appoach to Lage-Scale Systems Reliability. n: Poceedings of the 20th Annual ntenational Confeence on Supecomputing; 2006 June 28 - Jul 1; Cains, Queensland, Austalia. [5] Oline A, Sahoo R. Evaluating Coopeative Checkpointing fo Supecomputing Systems. n: Poceedings of the 20th ntenational Paallel and Distibuted Pocessing Symposium (PDPS 06); 2006 Ap 25-29; Rhodes sland, Geece. [6] Xiang Y, Li Z, Chen H. Optimizing Adaptive Checkpointing Schemes fo Gid Wokflow Systems. n: Poceedings of the 5th ntenational Confeence on Gid and Coopeative Computing (GCC 06); 2006 Oct 21-23; Ghangsha, Hunan, China. [7] Ziv A, Buck J. An On-Line Algoithm fo Checkpoint Placement. EEE Tansactions on Computes 1997; 46(9): 976-985