Adaptive Checkpointing in Dynamic Grids for Uncertain Job Durations

Adaptive Checkpointing in Dynamic Gids fo Uncetain Job Duations Maia Chtepen, Bat Dhoedt, Filip De Tuck, Piet Demeeste NTEC-BBT, Ghent Univesity, Sint-Pietesnieuwstaat 41, Ghent, Belgium {maia.chtepen, bat.dhoedt, fillip.detuck, piet.demeeste}@intec.ugent.be Filip H.A. Claeys MOSTfoWATER NV, Koning Leopold -laan 2, Kotik, Belgium fc@mostfowate.com Pete A. Vanolleghem modeleau, Univesité Laval, Québec, Qc, G1K 7P4, Canada pete.vanolleghem@gci.ulaval.ca Abstact. Adaptive checkpointing is a elatively new appoach that is paticulaly suitable fo poviding fault-toleance in dynamic and unstable gid envionments. The appoach allows fo peiodic modification of checkpointing intevals at un-time, when additional infomation becomes available. n this pape an adaptive algoithm, named MeanFailueCP+, is intoduced that deals with checkpointing of gid applications with execution times that ae unknown a pioi. The algoithm modifies its paametes, based on dynamically collected feedback on its pefomance. Simulation esults show that the new algoithm pefoms even bette than adaptive appoaches that make use of exact infomation on ob execution times. Keywods. Gid computing, fault-toleance, adaptive checkpointing. 1. ntoduction Fault-toleance is an impotant issue in the domain of gid computing, since gids ae composed of highly distibuted, decentally managed and thus potentially uneliable esouces. Application (ob) checkpointing is a technique that is commonly applied to povide fault-toleance in gids. The efficiency of this technique stongly depends on a good choice of a checkpointing inteval: an ovely shot checkpointing inteval leads to a lage numbe of edundant checkpoints, which delay ob pocessing by consuming computational and netwok esouces; on the othe hand, when a checkpointing inteval is too long, a substantial amount of wok has to be edone in case of a esouce failue. The optimal length of a checkpointing inteval is howeve extemely had to detemine befoe un-time, when no exact knowledge on ob and gid paametes is available (ob execution time, esouce failue patten, etc.). Futhemoe, chaacteistic to gid paametes is that they can dynamically change ove time, which implies that even if an appopiate checkpointing inteval is initially chosen, the pefomance of a static checkpointing algoithm, that elies on this choice, will degade ove time. To deal with this issue, eseach in the checkpointing aea has ecently tuned its attention to adaptive checkpointing solutions [2-7]. The latte allow fo dynamic modifications of an initial checkpointing inteval, as moe infomation becomes available on gid wokload and system paametes. n this pape a new adaptive checkpointing appoach, named MeanFailueCP+, is intoduced. MeanFailueCP+ is designed to opeate in absence of exact infomation on ob length. The algoithm avoids unnecessay checkpointing by modifying its intenal paametes in function of dynamically collected feedback on the system pefomance. We compae the pefomance of the new algoithm against the pefomance of in ou pevious wok intoduced adaptive solution (MeanFailueCP) [1]. MeanFailueCP is designed to modify (incease o decease) a ob checkpointing inteval as a function of mean failue fequency of esouces whee the ob is being executed, and the total ob execution time. The main disadvantage of this algoithm is that it elies on the assumption that the exact ob length can be

povided in advance, while fo most existing eal-wold applications this cannot be taken fo ganted. Futhemoe, ou ecent eseach has shown that MeanFailueCP, while significantly outpefoming peiodic checkpointing, still intoduces a consideable amount of edundant state savings. On the othe hand, MeanFailueCP+ not only weakens the equiement fo the exact ob duation to be known in advance, but also futhe educes the checkpointing ovehead. This pape is oganized as follows: Section 2 gives an oveview of elated wok; Section 3 summaizes the opeation of MeanFailueCP; in Section 4 the MeanFailueCP pefomance is evaluated; Section 5 discusses MeanFailueCP+; MeanFailueCP+ is evaluated in Section 6; and, finally, Section 7 concludes the pape. 2. Related Wok n [7] an on-line checkpointing algoithm is poposed that can be seen as a pedecesso of moden adaptive solutions. The algoithm uses on-line knowledge of the cuent cost of a checkpoint when it decides whethe o not checkpointing has to be pefomed. The main idea behind the algoithm is to look fo points in an application in which its state size is small and in which placing a checkpoint is the most beneficiay. n these points checkpointing is pefomed fequently, while in points with high cost, long checkpointing intevals ae used. An obvious disadvantage of this appoach is that it does not take into the account the esouce failue patten. n [4] and [5] the so-called coopeative checkpointing concept is intoduced, which addesses system pefomance and obustness issues by allowing the application pogamme, the compile and the un-time system to ointly decide on the necessity of each checkpoint. The algoithm poposed in ou pape is also based on this concept and thus can be seen as an coopeative (adaptive) heuistic. n [6] adaptive checkpointing is applied fo fault detection and ecovey. Ovehead is educed by diffeentiating fequencies of occuence of stoe checkpoints (SCPs) and compae checkpoints (CCPs). The disadvantage of this scheme is that it equies accuate infomation on emaining ob execution time and the expected emaining numbe of failues befoe ob temination. [2], in tun, consides only dynamic checkpointing inteval eduction in case it leads to computational gain, which is quantified by the sum of the diffeences between the means fo fault-affected and fault-unaffected ob esponse times. n [3] yet anothe adaptive fault management scheme (FT-Po) is discussed. FT- Po combines adaptive checkpointing with poactive pocess migation. The appoach optimizes application execution time by consideing the failue impact and the pevention costs. FT-Po suppots thee pevention actions: skip checkpoint, take checkpoint and migate. An adaptation manage selects an appopiate action in esponse to failue pediction. The effectiveness of FT-Po stongly depends on the quality of this pediction. 3. MeanFailueCP t i C RE < MF < α E : = = 2 C R t i C RE > MF RE MF <α E < α E : = 2 Figue 1. Opeation of MeanFailueCP on a esouce unning a single ob = MeanFailueCP is an adaptive algoithm that dynamically modifies the initially specified checkpointing inteval to optimize the numbe of checkpoints taken and thus to educe the computational ovehead. The size of the adopted checkpointing inteval ( ) is detemined by the cuently emaining ob execution time (RE ) and the aveage failue inteval (MF ) of the esouce whee the ob is assigned. Opeation of the algoithm is visualized in Fig. 1. MeanFailueCP is fist activated afte a shot time peiod t i (defined by the end-use) afte the beginning of ob execution (Step 1). Ealy activation of the algoithm opens the possibility to modify the checkpointing inteval at an ealy stage of ob pocessing. n each iteation the algoithm checkpoints the ob state and detemines the timestamp fo the next checkpointing event as follows: f RE < MF and < α E, whee α is a use-specified paamete and E is the total execution time of the ob on the esouce : the checkpointing inteval is inceased new = old +, whee is the length of the initial checkpointing inteval povided by the end-use (Step 2). The fist condition leads to eduction of checkpointing ovehead fo sufficiently stable

esouces o almost finished obs. The second condition pevents excessive gowth of, compaed to the ob length. f RE > MF o α E : the checkpointing inteval is deceased new = old (Step 3). When educing the checkpointing inteval, the following constaint should be taken into account: C < β E new, whee β < 1 is a use-defined value that secues that the time inteval between consecutive checkpoints neve deceases below the time ovehead added to a ob execution time by each checkpoint (C). Expeiments have shown that to pevent undesiably steep deceases of the checkpointing inteval, the value assigned to β should be at least 0.01, o 1% of a ob length. Finally, modifying values of by ensues fast achievement of (sub)optimal checkpointing fequency in most distibuted envionments. 4. Pefomance Evaluation of MeanFailueCP Pobability (%) 0.4 0.3 0.2 0.1 Pobability Density Function Dev = 10 Dev = 100 Dev = 1,000 0 50.0 55.0 60.0 65.0 70.0 Job Length (min) Figue 2. Pobability density function of ob length distibution fo example values of deviation (Dev) paamete MeanFailueCP assumes that the exact ob length is known befoehand. Howeve, thee ae two poblems with this assumption. Fist of all, it seems to be inapplicable fo a lage goup of the eal-wold applications, fo which only a vey ough estimation of the total ob length can be povided in advance. Secondly, ecent simulation expeiments show that knowledge of the exact ob length does not necessaily lead to the bette algoithm s pefomance. The latte, is a consequence of the fact that MeanFailueCP does not geneate the optimal numbe of checkpoints, which leads to some edundancy. By caefully calibating the algoithms paametes, this edundancy can be eliminated to the lage extent. As opposed to [1], in this section we evaluate the influence of the quality of ob length estimates on the pefomance of MeanFailueCP. Using the discete event gid simulation envionment, called DSiDE [1], we model a heavily loaded dynamic gid consisting of 128 computational esouces, equally spead ove 4 globally distibuted sites. Jobs submitted to the consideed gid have a nomally distibuted length with an aveage of 1 hou and a standad deviation vaying as shown in Fig. 2. The checkpointing ovehead C vaies fom 2 to 5 s and the data size of each checkpoint, which is tansfeed ove the netwok to a single checkpointing seve, is 10 MB. α and β ae espectively initialized with 2 and 0.01. Two simulation scenaios ae consideed: in the fist scenaio gid esouces ae assumed to be highly unstable (c.f. desktop gid), with the dynamics of the failue occuence modeled by means of a Weibull distibution with the shape paamete k = 1800 (30 min) and the scale paamete λ = 0.7; in the second scenaio failues happen less fequently (k = 10800, λ = 0.7), which means that obs have high pobability to execute without being distubed by a failue. Fo both scenaio s we obseve the pefomance of MeanFailueCP when E is eithe calculated using the exact ob length, o the aveage length ove all submitted obs, o a cetain deviation fom this aveage. Fig. 3 and Fig. 4 show fo the unstable gid the numbe of successfully executed obs and the aveage numbe of checkpoints saved pe ob, fo vaying pobability density functions of ob length distibution. Fig. 5 and Fig. 6 show the same paametes fo the second simulation scenaio. The deviation fom the aveage ob length is depicted in the figues with + and signs, whee, fo example avg-30%, means that the length of the submitted obs was assumed to be the aveage ove all obs deceased with 30%. The simulation esults show that MeanFailueCP does not necessaily pefom bette fo the exact ob length. Fo instance, in case of highly unstable esouces (see Fig. 3 and Fig. 4), thee is a elatively lage set of appoximation values fo which the algoithm pefoms ust as good o even bette. n the example at hand, the system pefomance impoves with 10% when the length of the submitted obs is assumed to be twice as high as the aveage value. This can be explained by the fact that the assumed ob length in combination with the mean failue fequency of esouces futhe optimizes the numbe of checkpoints

pefomed, compaed to the exact algoithm. Howeve, as can be seen in the figues, when the numbe of checkpoints taken keeps educing, the pefomance of MeanFailueCP consideably degades. Simulation expeiments have shown that thee can be seveal (sub)optima, on the positive and on the negative side of the aveage, howeve one of them, if any, always lies on the positive side. n geneal, a decease in the assumed ob length below the aveage value leads to a apid decease in numbe of checkpoints, since in that case the equation RE < MF almost always evaluates to tue, which esults in the gowth of the checkpointing inteval. Actually, the eason fo the above descibed behavio of the algoithm lies by the imposed limitations on the gowth/decease of checkpointing intevals. Fo instance, the equation β E new ensues that even in case of much oveestimated ob length and fequent failue, which would nomally lead to exaggeated checkpointing, the inteval is limited to a pecentage of the pedicted ob length. flexible than peiodic checkpointing [1], in unstable gids, the algoithm is still subect to futhe pefomance impovement. On the othe hand, when the gid system is stable (see Fig. 5 and Fig. 6) MeanFailueCP pefoms moe o less simila fo all consideed values fo a ob length. This is the esult of oveall limited checkpointing, esulting fom long failue-fee intevals, and the educed effect of failue on the system pefomance. Clealy, the optimal ob length pediction depends on seveal paametes, such as the length of the failue-fee inteval, limits on the checkpointing inteval, checkpointing ovehead etc. t is not only had to collect a eliable estimation fo these paametes befoehand, but also the actual values of the consideed paametes will pesumably change ove time, which undemines the usability of the static estimates. Theefoe, in the following section we intoduce MeanFailueCP+ that pefom dynamic seach of the optimal ob length estimation, using un-time infomation on the system pefomance. 2000 2700 # Jobs 1600 1200 800 Standad Deviation MFCP(aveage) # Jobs 2500 2300 2100 1900 1700 Standad Deviation MFCP(aveage) Figue 3. Aveage numbe of obs executed by MeanFailueCP, with vaying ob length estimation, in an unstable gid Figue 5. Aveage numbe of obs executed by MeanFailueCP, with vaying ob length estimation, in a stable gid 80 Aveage Numbe of Checkpoints 35 Aveage Numbe of Checkpoints # Checkpoints 60 40 20 0 Standad Deviation MFCP(aveage) # Checkpoints 30 25 20 15 10 5 0 Standad Deviation MFCP(aveage) Figue 4. Aveage numbe of checkpoints saved pe ob by MeanFailueCP, with vaying ob length estimation, in an unstable gid The above esult suggests that despite the fact that MeanFailueCP is moe efficient and Figue 6. Aveage numbe of checkpoints saved pe ob by MeanFailueCP, with vaying ob length estimation, in a stable gid 5. MeanFailueCP+

A typical gid application geneates batches of simila obs, which ae moe o less simultaneously submitted to the gid fo pocessing. Theefoe, opposite to MeanFailueCP, which egads individual obs and equies thei exact length to be known in advance, MeanFailueCP+ opeates on ob batches and needs only ough initial ob length estimation (L b ) to be povided by an end-use. Obsevation of eal-wold application leads us to the conclusion that speading of ob lengths within a single batch can be appoximated by a nomal distibution. The aveage of this distibution can be deived fom histoical infomation on pevious application uns and utilized fo initialization of L b. To optimize the system thoughput, MeanFailueCP+ monitos dynamically the numbe of obs pocessed duing a monitoing inteval of pedefined length M b and based on this feedback modifies subsequent ob length estimates (L b ) in such a way that the checkpointing ovehead is minimized without significantly penalizing the system faulttoleance. The length of the inteval M b should be chosen in function of L b. Simila to MeanFailueCP, MeanFailueCP+ is fist activated afte a shot time inteval t i afte the beginning of the ob execution and is aftewads called each time expies. The algoithm poceeds as follows: f T c T m < M b, whee T c is the cuent time and T m stands fo the begin time of the last monitoing inteval: MeanFailueCP is un with E = L b = L b. f T c T m M b and N M < 2, whee N M is the numbe of monitoing intevals aleady elapsed: MeanFailueCP+ slightly inceases the ob length estimation with a small andomly chosen value, called deviation value (D), which is in ou case set to 0.1: L b = L b + L b D. Aftewads, MeanFailueCP is executed with a new value fo E = L b. The gadual incease of L b allows the algoithm to exploe othe estimations of ob length and to escape fom an eventual local maximum. n the following phase the algoithm evaluates the effect of this slight incease in ob length, howeve, at this point in its execution thee is still insufficient pefomance data collected to pefom the evaluation (N M < 2). f T c T m M b and N M > 2: pefomance of the algoithm ove the past two monitoing intevals is evaluated. Each time a ob successfully teminates its execution, the ob count of the cuent monitoing inteval is incemented. n this phase, the algoithm compaes the numbe of obs executed duing the last monitoing inteval (N JL ) against the numbe of obs executed duing the last but one monitoing inteval (N JLBO ). f N JL = N JLBO, the deviation value is again slightly incemented D = D + 0.1, togethe with the estimated ob length L b = L b + L b D. f N JL > N JLBO, it means that ecent changes have positive effect on the algoithm s pefomance. Theefoe, we again incease the deviation pecentage D = D + 0.1. Aftewads, the new value of D is compaed against the pefomance incease P = (N JL N JLBO ) (N JLBO 0.01). f P > D, L b is modified as follows L b = L b + L b P, othewise L b = L b + L b D. This opeation ensues that the incease in the estimated ob length is at least popotional to the achieved pefomance incease. Finally, if N JL < N JLBO, it means that the cuent value of L b is too high and has to be educed. The size of the eduction is chosen to be popotional to the decease in the pefomance: L b = L b L b ((N JLBO N JL ) (N JLBO 0.01)). Once optimal values of L b and M b ae found fo a paticula application, they can be saved to be used fo the following application uns. 6. Pefomance Evaluation of MeanFailueCP+ We evaluate the pefomance of MeanFailueCP+ in the simulated gid envionment descibed in Section 4. The initial ob length estimation L b is set to 1 hou, o the aveage length of all submitted obs, and the monitoing inteval M b is consequently initialized with 30 min, 1 hou and 2 hous. Fig. 7 and Fig. 8 show the simulation esults fo two vaying fequencies of gid failue scenaios. Fo compaison, next to the numbe of successfully pocessed obs by MeanFailueCP+, the figue depicts the numbe of obs pocessed by MeanFailueCP with the exact ob lengths. Also the best esult (MFCP, avg+100%), achieved in Section 4, is pesented in the figues. n the case of highly unstable gids, MeanFailueCP+, with the monitoing inteval equal to the aveage ob length, leads to the best ob thoughput. Howeve, MeanFailueCP+ whee M b initialized with diffeent values, also pefoms bette than MeanFailueCP. As can be expected, within a stable gid the benefit of MeanFailueCP+ is less significant.

# Jobs 2000 1900 1800 1700 1600 1500 1400 MFCP+(0,5h) MFCP+(1h) MFCP+(2h) the exact ob length is an ovely stict equiement, which does not necessaily lead to optimal algoithm pefomance. Simulation esults show that MeanFailueCP+, without a pioi knowledge of the exact ob length, inceases gid thoughput with up to 10%, compaed to the thoughput of MeanFailueCP, initialized with exact values. 1300 Standad Deviation 8. Refeences Figue 7. Numbe of obs successfully executed by MeanFailueCP+, with vaying monitoing inteval, in an unstable gid # Jobs 2700 2500 2300 2100 1900 1700 1500 Standad Deviation MFCP+(0,5h) MFCP+(1h) MFCP+(2h) Figue 8. Numbe of obs successfully executed by MeanFailueCP+, with vaying monitoing inteval, in a stable gid 7. Conclusion Adaptive ob checkpointing is a highly suitable technique to povide fault-toleance in heteogeneous and decentally managed gids. ts main advantage is that it allows fo dynamic modification of checkpointing intevals in function of application and system paametes collected at un-time. This pape intoduces an adaptive checkpointing algoithm named MeanFailueCP+ that opeates in absence of infomation on the total ob duation. The algoithm initially equies only a ough estimation of the ob length, which is modified at un-time, based on dynamically collected infomation on the algoithm s pefomance. We compae the pefomance of this new feedbackbased appoach against the pefomance of an adaptive checkpointing algoithm, named MeanFailueCP. MeanFailueCP detemines an appopiate checkpointing fequency based on ob execution time and esouce failue fequency. MeanFailueCP imposes, howeve, that the exact ob length is known befoe untime. n this pape we show that knowledge of [1] Chtepen M, Claeys F.H.A, Dhoedt B, De Tuck F, Demeeste P, Vanolleghem P.A. Adaptive Task Checkpointing and Replication: Towad Efficient Fault-Tolaant Gids. EEE Tansactions on Paallel and Distibuted Systems 2009; 20(2): 180-190. [2] Katsaos P, Angelis L, Lazos C. Pefomance and Effectiveness Tade-off fo Checkpointing in Fault-Toleant Distibuted Systems. Concuency and Computation: Pactice & Expeience 2007; 19(1): 37-63. [3] Lan Z, Li Y. Adaptive Fault Management of Paallel Applications fo High-Pefomance Computing. EEE Tansactions on Computes 2008; 57(12): 1647-1660. [4] Oline A, Rudolph L, Sahoo R. Coopeative Checkpointing: a Robust Appoach to Lage-Scale Systems Reliability. n: Poceedings of the 20th Annual ntenational Confeence on Supecomputing; 2006 June 28 - Jul 1; Cains, Queensland, Austalia. [5] Oline A, Sahoo R. Evaluating Coopeative Checkpointing fo Supecomputing Systems. n: Poceedings of the 20th ntenational Paallel and Distibuted Pocessing Symposium (PDPS 06); 2006 Ap 25-29; Rhodes sland, Geece. [6] Xiang Y, Li Z, Chen H. Optimizing Adaptive Checkpointing Schemes fo Gid Wokflow Systems. n: Poceedings of the 5th ntenational Confeence on Gid and Coopeative Computing (GCC 06); 2006 Oct 21-23; Ghangsha, Hunan, China. [7] Ziv A, Buck J. An On-Line Algoithm fo Checkpoint Placement. EEE Tansactions on Computes 1997; 46(9): 976-985