Local stageout update Subir Sarkar, Frank Würthwein, Johannes Mülmenstädt August 9, 2010
Big picture Local stageout requires the following pieces to be viable end-to-end: CRAB support (see Subir 7/26/2010) Proper permissions at sites (see Subir 7/26/2010) Enough space and automated cleanup at sites An offline tool to do the local stageout recovery (this talk) Local stageout update 1
Why do we believe this will help? J. Letts tested the entire matrix of T2 T2 connections for 3rd party transfers some time ago. He found that 10% of all connections failed on a given day. He found that of those 10%, again 10% failed when tried the next day. We thus hypothesize that user level retry of the stageout can bring the remote stageout error rate from 10% to 1% to 0.1%... via a simple set of successive tries within the one week that the sites are obliged to keep the local stageout files. This talk describes the tool we want to give to the users as part of the crab client deployment in order to do those retries as they see fit. Local stageout update 2
The tool The tool is a python script that will be distributed in the bin/ area of CRAB, starting with 2.7.4 Logic behind the script: 1. Parse the fjr s in a CRAB project directory 2. If the remote stageout failed but local stageout succeeded (exit code 60308), figure out the PFN at the local site and the intended PFN at the remote site 3. Attempt an lcg-cp from the local to the remote site 4. If the copy succeeds, rewrite the fjr to indicate success and wrapper exit code 0 (keeping a backup of the fjr) 5. If any step fails, skip to the next fjr This program can be run iteratively, because on the next invocation it will only attempt to copy the failed files Parsing of fjr s, invocation of external commands etc. are all wrapped in error handling code so that if something goes wrong, the error is reported (and nothing bad is done to the fjr) Local stageout update 3
Invocation Basic usage message is printed if no arguments are given: [ jmuelmen ]. / r e t r y s t a g e o u t. py usage : r e t r y s t a g e o u t. py c <crab d i r e c t o r y > [ dry run n ] [ q u i e t q ] [ v e r b o s e v vv vvv ] Supported arguments: c (Mandatory) CRAB project directory to parse dry run, n Do not copy anything, only print a list of local PFN s that need to be copied quiet, q Print only error messages or the list of PFN s produced by n verbose, v, vv, vvv Be verbose. The first level of verbosity prints what the program is doing and whether external commands succeeded; second level also prints the output of external commands; third level runs the external commands in verbose mode, if available Local stageout update 4
Example: normal-verbosity, with a single failed job [ jmuelmen ]. / r e t r y s t a g e o u t. py c ttw madgraph Spring10 START3X V26 S09 v1 r e t r y s t a g e o u t. py : p r o c e s s i n g f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml r e t r y s t a g e o u t. py : f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml i n d i c a t e s remote s t a g e out f a i l u r e with l o c a l copy r e t r y s t a g e o u t. py : c o p y i n g from l o c a l : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r a b t e s t i n g 2 / n t u p l e 6 1. r o o t to remote : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ s t o r e / u s e r / jmuelmen / c r a b t e s t i n g 2 / n t u p l e 6 1. r o o t r e t r y s t a g e o u t. py : r e w r i t i n g f j r to i n d i c a t e remote s t a g e o u t s u c c e s s r e t r y s t a g e o u t. py : backup path i s ttw madgraph Spring10 START3X V26 S09 v1 / r e s / r e t r y b a c k u p r e t r y s t a g e o u t. py : o l d f j r w i l l be backed up to ttw madgraph Spring10 START3X V26 S09 v1 / r e s / r e t r y b ackup / c r a b f j r 6. xml r e t r y s t a g e o u t. py : a l l f j r s p r o c e s s e d, e x i t i n g (The quiet version of that would have been no output at all, unless there had been an error.) Local stageout update 5
Example: a quiet dry run [ jmuelmen ]. / r e t r y s t a g e o u t. py n q c ttw madgraph Spring10 START3X V26 S09 v1 r e t r y s t a g e o u t. py : f i l e s t h a t need to be c o p i e d : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r a b t e s t i n g 2 / n t u p l e 6 1. r o o t Since no error occurred, the only output is the list of PFN s that need to be copied Local stageout update 6
Example: extreme verbosity And we mean extreme... (note that -vvv also causes lcg-cp to become verbose, for example) Trying SURL srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r [ jmuelmen ]. / r e t r y s t a g e o u t. py vvv c ttw madgraph Spring10 START3X V26 S09 v1 r e t r y s t a g e o u t. py : e x e c u t i n g command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / nodes r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / nodes e x i t s t a t r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / nodes output : < r e t r y s t a g e o u t. py : e x e c u t i n g command : / b i n / l s ttw madgraph Spring10 START3X V26 S09 v1 / r e s /. xml r e t r y s t a g e o u t. py : command : / b i n / l s ttw madgraph Spring10 START3X V26 S09 v1 / r e s /. xml e x i t s t a t u s : 0 r e t r y s t a g e o u t. py : command : / b i n / l s ttw madgraph Spring10 START3X V26 S09 v1 / r e s /. xml output : ttw m r e t r y s t a g e o u t. py : p r o c e s s i n g f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml r e t r y s t a g e o u t. py : f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml i n d i c a t e s remote s t a g e o r e t r y s t a g e o u t. py : l o c a l s t a g e o u t nodename = T2 US UCSD r e t r y s t a g e o u t. py : e x e c u t i n g command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / l f n 2 p f n? n r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / l f n 2 p f n? node=t r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / l f n 2 p f n? node=t r e t r y s t a g e o u t. py : e x e c u t i n g command : grep e x p o r t e n d p o i n t = ttw madgraph Spring10 START3X V26 S09 v1 / j o b r e t r y s t a g e o u t. py : command : grep e x p o r t e n d p o i n t = ttw madgraph Spring10 START3X V26 S09 v1 / j o b /CMSS r e t r y s t a g e o u t. py : command : grep e x p o r t e n d p o i n t = ttw madgraph Spring10 START3X V26 S09 v1 / j o b /CMSS r e t r y s t a g e o u t. py : c o p y i n g from l o c a l : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / r e t r y s t a g e o u t. py : e x e c u t i n g command : lcg cp v D srmv2 srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/ r e t r y s t a g e o u t. py : command : lcg cp v D srmv2 srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoo r e t r y s t a g e o u t. py : command : lcg cp v D srmv2 srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoo Using g r i d c a t a l o g : prod l f c shared c e n t r a l. c e r n. ch VO name : cms Checksum type : None Source SE type : SRMv2 Source SRM Request Token : get :981202 D e s t i n a t i o n SE type : SRMv2 D e s t i n a t i o n SRM Request Token : put :981203 Source URL : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r Local stageout update 7
Tests Various failure modes were tested on small sets of files missing permissions no proxy corrupted fjr s... and the like Every call to an external program is wrapped in error-checking code Error handling is simple: print the error message and skip to the next file In addition, the tool passed some stress tests over the weekend: 120 180MB ROOT files from MIT and Nebraska to Pisa 20 1.8 GB ROOT files from UCSD to Pisa passed both tests Local stageout update 8
Conclusion We have developed a tool which can be used in conjunction with local stageout The tool copies files from the local to the remote SE if the fjr s indicate that remote stageout failed The tool is robust against failure at any of the various steps of the procedure It can be used incrementally to retry the copying of files that did not succeed on the previous pass It is ready for inclusion in CRAB 2.7.4 Local stageout update 9