9 Inrnaional Confrnc on Compur Enginring and Applicaions II vol. IAI rss, Singapor Sudy on h ighwigh chcpoin basd rollbac rcovry mchanism Zhang i,3, ang Rui Dai Hao 3, Ma Mingai 3 and i Xianghong 4 Insiu of Command Auomaion A Univrsiy of Scinc and chnology rain bas of Gnral Saff am 5 3 Insiu of China Elcronic Equipmn Sysm Enginring Company 4 Burau of Xinhua Nws Agncy Communicaion chnology Absrac. h clusr srvic rcovry mchanism is vry imporan in h sudy of survivabl srvr clusr. Afr analyzing h rlad wor, his papr poin ou h dficincy of h xisd rcovry mchanisms, and propos h ighwigh chcpoin basd rollbac rcovry mchanism ha aims o nhanc h rcovry prformanc, h xprimn rsul valida ha h prformanc xclld xisd mhods. Kywords: Rollbac rcovry, Clusr, Srvic survivabiliy.. Inroducion ih h dvlopmn of h nwor chnology and h incras of usrs rquirmn, h coninuiy of h nwor srvic providd by h clusr mus b nhancd. Som applicaions such as lcronic businss rquir vry high qualiy of srvic, so h survivabl srvr clusr sysm mus hav no only h abiliy of rsisanc, bu also h abiliy of rcovry whn ncssary. h y applicaion daa should b ingrad and h rcovry mchanism should b ransparn o h usrs, finally h applicaion program affcd vry lil. Chcpoin basd Rollbac Rcovry R mhod is a common faul-olran chnology in sofwar/hardwar sysms. R [ mans ha h sysm or h procss sar o run anw from h righ poin of h saus which has bn savd bfor h faul im. In h srvr clusr sysms, i is ncssary of h chcpoin informaion o rcovry h faul srvic. h rasons ha caus h srvic inrrupiv may b diffrn in h sudy of faul-olran and survivabiliy, bu h srvic survivabiliy sudy dosn focus on h causaions, so h R also can b usd in h srvic survivabiliy sudy. Alhough h rollbac rcovry chnology providd h rliabiliy, i incrass h complxiy of h sysm. Chcpoin [ is h copy of h applicaion program saus, which is savd in h sady sorag; w assum ha h sady sorag nvr fails. h chcpoin informaion can also b sn o h sorag of h ohr clusr nod in ordr o rcovry h srvic in ha nod, his mhod is usually usd in h clusr sysm and dcras h rcovry im.. Chcpoin Mchanism.. Chcpoin Informaion I is flxibl o s chcpoin, som applicaions can rcovry from vry lil informaion, som ohr applicaions can rcovry from h spcial chcpoin informaion such as spcial im, and h informaion Corrsponding auhor. l.: 8666885; fax: 8666858. E-mail addrss: zhangli_ism@63.com. 39
may b vry small. o dcras h ovrhad of h program, w can dcras h numbr of chcpoin or h siz of h chcpoin fils. In h ordinary applicaion programs, chcpoins nd o sav hs conns as follows[3: procss daa fild, conn in h usr sac; conx rlad ims, includ program counr, procss or saus rgisr, saus poin, c; 3 aciv fil informaion, includ fil dscripor, accss mod, fil siz, rad/wri poin, c; 4 rlad signal informaion, includ shild cod, sac poin, procss funcion handlr, and suspndd signal flag; 5 usr fil conn, rgisr conn. hs conn of h procss saus can b usd in singl poin srvr, bu in h srvr clusr sysms, h chcpoin informaion nd includ: 6 clusr viw of h srvr nod; 7 bacup and inrplay informaion among clusr nods; 8 h im and conn whn rciv/snd mssags; 9 spcial informaion rlad o h applicaion. improv h radiional chcpoin mhods, w rcord as las informaion as possibl whn sing chcpoin, bu hs informaion can b usd whn rcovry h procss. For xampl, for h snd and rciv mssag pair, w only rcord on of hm, whn h program runs corrcly, h ohr mssag is also corrc. sav h chcpoin in daa fild, li abl : abl: lighwigh chcpoin informaion daa fild Exc_sa ro_cl Mm_add Msg_sa Fil_sa Clusr_viw Rsrvd hr ar svn filds in h lighwigh chcpoin, hs conn ar as follows: Exc_sa: procss running saus; includ conx ransiion informaion, rgisr and sac poin informaion; ro_cl: procss conrol saus, procss id, procss rlaionship; Mm_add: mmory and addrss saus; includ conn in h sac, daa fild rlad o h procss; Msg_sa: rcord of snding mssag and mssag in cach; Fil_sa: h fil sa, includ h fil lis which procss opnd, handlr and cach informaion; Clusr_viw: h saus of h clusr, includ h nod rol, bacup and inrplay informaion among clusr nods; Rsrvd: som fild o sav somhing rlad o h spcial applicaions. hs ar h whol conn of h lighwigh chcpoin; h firs hr ims can rcovry a procss somims, so his can hardly rduc h burdn which chcpoin bring on... S chcpoin In h arly days, whn o s chcpoin, h running procss mus b suspndd [4 during his im, h procss coninu o run afr finishing h chcpoin. his mhod guaran h chcpoin informaion cohrn among h whol clusr, bu i influncd h usual procss running, and h ovrhad is also larg, i is no fi for hs applicaions ha hav many rquirs and larg daum. In ordr o prvn h chcpoin mchanism o influnc h usual applicaions, som rsarchrs build nw hrad o s chcpoin spcially, his hrad can communica wih h main procss, and rcord h saus of vry hrad in h chcpoin im, sav h mssags, ma h chcpoin fil as las as possibl. Manim, ohr hrads can coninu running h usual procsss. Basd on h B [5 Brly ab s inux Chcpoin/Rsar, w improv h sing policy of h chcpoin, add h lighwigh mchanism, and rduc h ovrhad of h chcpoin. In ordr o guaran h rliabiliy of h rcovry mhod, h sorag mus b absoluly rliabl. hr ar wo phass of sing h lighwigh chcpoin, in h firs phas, h applicaion iniializ wih a hrad rcall funcion, his rcall funcion gnra a rcall hrad, whn o s chcpoin, his hrad sop blocing, h ohr hrads in h procss coninu o run hir own program, h fil handlr is usd o sav chcpoin informaion. hn h las hrad rcalls o h cr_chcpoin saus, h rcall funcion rurn o h rnl saus, and h procss nr h scond phas. 4
In h scond phas, sysm calls snd chcpoin signal o ohr hrads, afr all h hrads nring rnl saus, h chcpoin informaion bgins o b rcordd. Afr h hrads wri hir own saus ino h chcpoin fil, h sysm rurns o h usr spac, coninu xcuing ohr program cods..3. Rollbac rcovry Using h chcpoin informaion rcovry h applicaion procss, w mainly rly on h iocl call o rad chcpoin fil daa. h rollbac rcovry mans o r-xcu procss from h las chcpoin im whn h sysm fails. Undr inux sysm, firs rcall h funcion do_for o gnra procss, hn using clon o gnra all hrads ha h procss ndd, sarup rcall procss o run hs hrads. h firs hrad rads h chcpoin fil, rcovry h shard ims and is own id, rgisr and signal saus, ohr hrads gradually rcovry h savd saus, a las, rcovry h rlaionship of hs hrads, hn, h procss rurns from h rnl saus, afr all h procsss rcovry, ohr applicaion cods bgin o xcu. 3. Opimal Chcpoin Inrvals hn hr ar fwr failurs, and h running nvironmn is good, sing oo many chcpoins will influnc h usual applicaion prformanc. On h ohr hand, if hr ar high load, and h probabiliy of failur will b high, small chcpoin inrvals may bring on larg rcovry ovrhad, bcaus h procss mus r-xcu far from h failur im. will dcid h opimal chcpoin inrvals in his scion. 3.. im spnding of sing chcpoin consum ha h srvr nod failurs in h clusr prsn a oisson disribuing, h failur probabiliy is. h im spnding of a singl chcpoin undr no failur condiions is ; h chcpoin duraion is. h im spnding of rcovry is. Noic ha h and ar consan as o h spcial applicaion. In h priod of a chcpoin, h program xcuing im incrass, h im spnding is -. Consum ha h applicaion xcuing im is, undr h chcpoin mchanism, h avrag ral xcuing im of h program is, h as is dividd by svral chcpoins, dnos by K, h avrag im spnding of h chcpoin is, so K*. Srvic duraion C can b calculad as: C/. 3.. Saus ransformaion hr ar all wo inds of saus in h clusr nods, on is normal saus, running h procss, includ sing h chcpoin, anohr saus is failur, h procss rollbac o rcovry. Figur is h saus ransformaion. Figur. saus ransformaion In figur, saus is h normal saus, and saus is h failur saus, h lrs on h arrow sid ar h ransformaion probabiliy. h wo saus ransform modl rduc h complxiy of h sysm saus, and i is asy o program. 3.3. Opimal chcpoin inrvals Expcd chcpoin inrvals can b solvd using h wo-sa discr Marov chain prsnd in figur. As o h oisson procss, h ransiion ra is, in im, h ra of vns ar, x dnos h numbrs of h vns, hn 4
,,,! x p Each ransiionx, y, from sa X o sa Y in h Marov chain, has an associad ransiion probabiliy xy and a cos xy, hs variabls can b calculad as follows on by on. 3 4 According o h ransiion probabiliy and condiion probabiliy, w can g h im cos as follows: [ d d 5 [ d d 9 6 7 8 From sa o sa, h procss finishs h rollbac rcovry, h avrag im spnding is, no ha is h avrag rcovry im, so w g formula 5-9. Now, w g all h avrag im spnding of h sa ransiion, l Ψ dno h avrag applicaion procss running im, according h im spnding and hir probabiliy, w g Ψ as follows, Ψ inroduc chcpoin im fficincy Ψ ; no ha is h avrag chcpoin inrvals, hn h propr should saisfy h following formula and 3, [ [ 3 us zplo funcion in Malab program o figur h rlaionship bwn and y, no ha y is, w g h rsul 4
3 4 5 6 7 8 9 x -7 / xp/ 3//-xp/ 3/-/ x -7 4 7/ xp7/ /5/-xp7/ /5-/ X: 375.4 X: 593.3 Y: Y: X: 36.9 Y: X: 45. Y: X: 45. Y: - X: 54.6 Y: - -4-6 - -8-3 - 3 4 5 6 7 8 a Figur. h opimal chcpoin inrvals b Figur a is h diffrnial cofficin funcion of y, ar 3, 6, 8 rspcivly, h propr is h poin whr y is zro, w conclud ha as h incras, h propr is incrasing. From figur b, w conclud ha as incras, h propr is dcrasing, so h propr should b dcidd according o h spcial condiions. 4. prformanc sudy In ordr o valua h prformanc of h lighwigh chcpoin mchanism, w uiliz h Charm languag o di h paralll sysm, and implmn h chcpoin basd rollbac rcovry mhod. choos n-inux, h compilr is GNU. his sag can call h iniial program of h opraion sysm; h mhod of h papr is mbddd in h applicaion program o valua is prformanc. a Figur 3. rcovry im prformanc comparing b Figur 3a is h comparing rsul of rcovry mhod and h radiional log basd mhod, undr svral sysm failurs, h avrag running im of lighwigh chcpoin basd rollbac rcovry is smallr han h log basd mhod. hn is consan, h rcovry mhod in his papr is fficin han h pssimisic log basd rcovry mhod, no only h avrag running im is smallr, bu also h CU rsourc is conomic. 5. Conclusion improvd h radiional chcpoin basd rcovry mhod, pr-digsd h chcpoin fil, his mad h chcpoin sing procss and rcovry asy and fficin o prform. h rcovry procss can choos h lighwigh chcpoin informaion fild daa, which can rduc h rcovry im vry much. h xprimn rsul showd ha his mhod consum lss marials and mor fficin. 6. Rfrncs [ E. N. Elnozahy,. Alvisi, Y. ang, and D. B. Johnson. A survy of rollbac-rcovry proocols in mssag-passing sysms. ACM Compu. Surv. 343: 375-48,. [ Vaidya, N.H., Impac of chcpoin lancy on ovrhad raio of a chcpoining schm, IEEE ransacions on Compurs, Volum 46, Issu 8, Aug. 997: 94-947. [3 ogging Srvic, hp://docs.sun.com/sourc/86-6687- /logging.hml. 8--3. [4 annnbaum, izow M. h Condor Disribud rocssing Sysm [J. Dobb s Journal, 995, : 4-48. [5 J. Dull,. argrov, E. Roman, h Dsign and Implmnaion of Brly ab s inux Chcpoin/Rsar, --3. 43