Local stageout update

Similar documents
Lab 6: Linear Algebra

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

1 Opening URLs. 2 Regular Expressions. 3 Look Back. 4 Graph Theory. 5 Crawler / Spider

Data to Datafordeleren

Building a Lightweight High Availability Cluster Using RepMgr

Presuppositions (introductory comments)

Lecture 5. September 4, 2018 Math/CS 471: Introduction to Scientific Computing University of New Mexico

Dynamics of the Atmosphere GEMPAK Lab 3. 3) In-class exercise about geostrophic balance in the real atmosphere.

A GUI FOR EVOLVE ZAMS

Scripting Languages Fast development, extensible programs

Chapter 1. Root Finding Methods. 1.1 Bisection method

lightcurve Data Processing program v1.0

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Abstract parsing: static analysis of dynamically generated string output using LR-parsing technology

Monte Carlo Status. Bradley Yale. Spring 2017 Collaboration Meeting 05/04/2017

CS 124 Math Review Section January 29, 2018

Mathematical Logic Part One

Lecture 10: Gentzen Systems to Refinement Logic CS 4860 Spring 2009 Thursday, February 19, 2009

CS425: Algorithms for Web Scale Data

Overlay Transport Virtualization (OTV) Unicast-Mode Transport Infrastructure Deployment

Web GIS Deployment for Administrators. Vanessa Ramirez Solution Engineer, Natural Resources, Esri

1 Recap: Interactive Proofs

COMS 6100 Class Notes

Python. Tutorial. Jan Pöschko. March 22, Graz University of Technology

MONTE CARLO METHODS IN SEQUENTIAL AND PARALLEL COMPUTING OF 2D AND 3D ISING MODEL

ORBIT Code Review and Future Directions. S. Cousineau, A. Shishlo, J. Holmes ECloud07

Robert D. Borchert GIS Technician

Software Testing Lecture 2

Exam 3, Math Fall 2016 October 19, 2016

Your Second Physics Simulation: A Mass on a Spring

Appendix 4 Weather. Weather Providers

MITOCW MITRES2_002S10nonlinear_lec05_300k-mp4

from Euclid to Einstein

Coordination. Failures and Consensus. Consensus. Consensus. Overview. Properties for Correct Consensus. Variant I: Consensus (C) P 1. v 1.

Deep-dive into PyMISP MISP - Malware Information Sharing Platform & Threat Sharing

Parameter identification of damage parameters of LS-DYNA GURSON material model from a tensile test. Lectures. Johannes Will

COMP Assignment 1 Solutions

Problem Decomposition: One Professor s Approach to Coding

COMP 204. Exceptions continued. Yue Li based on material from Mathieu Blanchette, Carlos Oliver Gonzalez and Christopher Cameron

MATERIAL MECHANICS, SE2126 COMPUTER LAB 3 VISCOELASTICITY. k a. N t

GFC_DEFRAG: FREE SPACE DEFRAGMENTATION UTILITY PACKAGE

Distributed Systems Principles and Paradigms

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Proof: If (a, a, b) is a Pythagorean triple, 2a 2 = b 2 b / a = 2, which is impossible.

NE 204 mini-syllabus (weeks 4 8)

Analysis of Planar Truss

Elite Galaxy Online. API Documentation v Elite Galaxy Online. All rights reserved

git Tutorial Nicola Chiapolini Physik-Institut University of Zurich March 16, 2015

Replication cluster on MariaDB 5.5 / ubuntu-server. Mark Schneider ms(at)it-infrastrukturen(dot)org

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

Computability Crib Sheet

CHEOPS Feasibility Checker Guidelines

How to write maths (well)

Heuristic Alignment and Searching

The following syntax is used to describe a typical irreducible continuum element:

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Lecture 4 Implementing material models: using usermat.f. Implementing User-Programmable Features (UPFs) in ANSYS ANSYS, Inc.

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

CPSC 467: Cryptography and Computer Security

FACTORS AFFECTING CONCURRENT TRUNCATE

Knowledge Discovery and Data Mining 1 (VO) ( )

Arup Nanda Starwood Hotels

Robust Programs with Filtered Iterators

Finding the Nucleoli of Large Cooperative Games: A Disproof with Counter-Example

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 1

PHP-Einführung - Lesson 4 - Object Oriented Programming. Alexander Lichter June 27, 2017

Simplicity of A5 Brute Force Method!

Introduction to Computer Tools and Uncertainties

Comp 11 Lectures. Mike Shah. July 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures July 26, / 40

Extended Introduction to Computer Science CS1001.py. Lecture 8 part A: Finding Zeroes of Real Functions: Newton Raphson Iteration

Program Analysis Part I : Sequential Programs

Improving the Testing Rate of Electronic Circuit Boards

Concurrent HTTP Proxy Server. CS425 - Computer Networks Vaibhav Nagar(14785)

git Tutorial Nicola Chiapolini Physik-Institut University of Zurich June 8, 2015

Paper 3F: Reading and Understanding in Chinese

Assignment 2 Atomic-Level Molecular Modeling

VPython Class 2: Functions, Fields, and the dreaded ˆr

Lab 1: Empirical Energy Methods Due: 2/14/18

git Tutorial Nicola Chiapolini Physik-Institut University of Zurich January 26, 2015

Let us distinguish two kinds of annoying trivialities that occur in CoS derivations (the situation in the sequent calculus is even worse):

Lecture Notes on Inductive Definitions

Science Analysis Tools Design

CS5412: REPLICATION, CONSISTENCY AND CLOCKS

Introduction to Portal for ArcGIS

Multidomain Design and Optimization based on Comsol Multiphysics: Applications for Mechatronic Devices

Note: Please use the actual date you accessed this material in your citation.

Chuck Cartledge, PhD. 21 January 2018

Example: sending one bit of information across noisy channel. Effects of the noise: flip the bit with probability p.

Lab 1: Handout GULP: an Empirical energy code

Study skills for mathematicians

MI-RUB Exceptions Lecture 7

Using the Prover I: Lee Pike. June 3, NASA Langley Formal Methods Group Using the Prover I:

Learning from Examples

Cantera / Stancan Primer

Techniques for Proof Writing

DMDW: A set of tools to calculate Debye-Waller factors and other related quantities using dynamical matrices.

Lectures about Python, useful both for beginners and experts, can be found at (

Introduction to ArcGIS Server Development

Transcription:

Local stageout update Subir Sarkar, Frank Würthwein, Johannes Mülmenstädt August 9, 2010

Big picture Local stageout requires the following pieces to be viable end-to-end: CRAB support (see Subir 7/26/2010) Proper permissions at sites (see Subir 7/26/2010) Enough space and automated cleanup at sites An offline tool to do the local stageout recovery (this talk) Local stageout update 1

Why do we believe this will help? J. Letts tested the entire matrix of T2 T2 connections for 3rd party transfers some time ago. He found that 10% of all connections failed on a given day. He found that of those 10%, again 10% failed when tried the next day. We thus hypothesize that user level retry of the stageout can bring the remote stageout error rate from 10% to 1% to 0.1%... via a simple set of successive tries within the one week that the sites are obliged to keep the local stageout files. This talk describes the tool we want to give to the users as part of the crab client deployment in order to do those retries as they see fit. Local stageout update 2

The tool The tool is a python script that will be distributed in the bin/ area of CRAB, starting with 2.7.4 Logic behind the script: 1. Parse the fjr s in a CRAB project directory 2. If the remote stageout failed but local stageout succeeded (exit code 60308), figure out the PFN at the local site and the intended PFN at the remote site 3. Attempt an lcg-cp from the local to the remote site 4. If the copy succeeds, rewrite the fjr to indicate success and wrapper exit code 0 (keeping a backup of the fjr) 5. If any step fails, skip to the next fjr This program can be run iteratively, because on the next invocation it will only attempt to copy the failed files Parsing of fjr s, invocation of external commands etc. are all wrapped in error handling code so that if something goes wrong, the error is reported (and nothing bad is done to the fjr) Local stageout update 3

Invocation Basic usage message is printed if no arguments are given: [ jmuelmen ]. / r e t r y s t a g e o u t. py usage : r e t r y s t a g e o u t. py c <crab d i r e c t o r y > [ dry run n ] [ q u i e t q ] [ v e r b o s e v vv vvv ] Supported arguments: c (Mandatory) CRAB project directory to parse dry run, n Do not copy anything, only print a list of local PFN s that need to be copied quiet, q Print only error messages or the list of PFN s produced by n verbose, v, vv, vvv Be verbose. The first level of verbosity prints what the program is doing and whether external commands succeeded; second level also prints the output of external commands; third level runs the external commands in verbose mode, if available Local stageout update 4

Example: normal-verbosity, with a single failed job [ jmuelmen ]. / r e t r y s t a g e o u t. py c ttw madgraph Spring10 START3X V26 S09 v1 r e t r y s t a g e o u t. py : p r o c e s s i n g f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml r e t r y s t a g e o u t. py : f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml i n d i c a t e s remote s t a g e out f a i l u r e with l o c a l copy r e t r y s t a g e o u t. py : c o p y i n g from l o c a l : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r a b t e s t i n g 2 / n t u p l e 6 1. r o o t to remote : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ s t o r e / u s e r / jmuelmen / c r a b t e s t i n g 2 / n t u p l e 6 1. r o o t r e t r y s t a g e o u t. py : r e w r i t i n g f j r to i n d i c a t e remote s t a g e o u t s u c c e s s r e t r y s t a g e o u t. py : backup path i s ttw madgraph Spring10 START3X V26 S09 v1 / r e s / r e t r y b a c k u p r e t r y s t a g e o u t. py : o l d f j r w i l l be backed up to ttw madgraph Spring10 START3X V26 S09 v1 / r e s / r e t r y b ackup / c r a b f j r 6. xml r e t r y s t a g e o u t. py : a l l f j r s p r o c e s s e d, e x i t i n g (The quiet version of that would have been no output at all, unless there had been an error.) Local stageout update 5

Example: a quiet dry run [ jmuelmen ]. / r e t r y s t a g e o u t. py n q c ttw madgraph Spring10 START3X V26 S09 v1 r e t r y s t a g e o u t. py : f i l e s t h a t need to be c o p i e d : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r a b t e s t i n g 2 / n t u p l e 6 1. r o o t Since no error occurred, the only output is the list of PFN s that need to be copied Local stageout update 6

Example: extreme verbosity And we mean extreme... (note that -vvv also causes lcg-cp to become verbose, for example) Trying SURL srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r [ jmuelmen ]. / r e t r y s t a g e o u t. py vvv c ttw madgraph Spring10 START3X V26 S09 v1 r e t r y s t a g e o u t. py : e x e c u t i n g command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / nodes r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / nodes e x i t s t a t r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / nodes output : < r e t r y s t a g e o u t. py : e x e c u t i n g command : / b i n / l s ttw madgraph Spring10 START3X V26 S09 v1 / r e s /. xml r e t r y s t a g e o u t. py : command : / b i n / l s ttw madgraph Spring10 START3X V26 S09 v1 / r e s /. xml e x i t s t a t u s : 0 r e t r y s t a g e o u t. py : command : / b i n / l s ttw madgraph Spring10 START3X V26 S09 v1 / r e s /. xml output : ttw m r e t r y s t a g e o u t. py : p r o c e s s i n g f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml r e t r y s t a g e o u t. py : f j r ttw madgraph Spring10 START3X V26 S09 v1 / r e s / c r a b f j r 6. xml i n d i c a t e s remote s t a g e o r e t r y s t a g e o u t. py : l o c a l s t a g e o u t nodename = T2 US UCSD r e t r y s t a g e o u t. py : e x e c u t i n g command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / l f n 2 p f n? n r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / l f n 2 p f n? node=t r e t r y s t a g e o u t. py : command : wget O q h t t p : / / cmsweb. c e r n. ch / phedex / d a t a s v c / xml / prod / l f n 2 p f n? node=t r e t r y s t a g e o u t. py : e x e c u t i n g command : grep e x p o r t e n d p o i n t = ttw madgraph Spring10 START3X V26 S09 v1 / j o b r e t r y s t a g e o u t. py : command : grep e x p o r t e n d p o i n t = ttw madgraph Spring10 START3X V26 S09 v1 / j o b /CMSS r e t r y s t a g e o u t. py : command : grep e x p o r t e n d p o i n t = ttw madgraph Spring10 START3X V26 S09 v1 / j o b /CMSS r e t r y s t a g e o u t. py : c o p y i n g from l o c a l : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / r e t r y s t a g e o u t. py : e x e c u t i n g command : lcg cp v D srmv2 srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/ r e t r y s t a g e o u t. py : command : lcg cp v D srmv2 srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoo r e t r y s t a g e o u t. py : command : lcg cp v D srmv2 srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoo Using g r i d c a t a l o g : prod l f c shared c e n t r a l. c e r n. ch VO name : cms Checksum type : None Source SE type : SRMv2 Source SRM Request Token : get :981202 D e s t i n a t i o n SE type : SRMv2 D e s t i n a t i o n SRM Request Token : put :981203 Source URL : srm : / / bsrm 1. t2. ucsd. edu :8443/ srm/v2 / s e r v e r?sfn=/hadoop /cms/ phedex / s t o r e /temp/ u s e r / jmuelmen / c r Local stageout update 7

Tests Various failure modes were tested on small sets of files missing permissions no proxy corrupted fjr s... and the like Every call to an external program is wrapped in error-checking code Error handling is simple: print the error message and skip to the next file In addition, the tool passed some stress tests over the weekend: 120 180MB ROOT files from MIT and Nebraska to Pisa 20 1.8 GB ROOT files from UCSD to Pisa passed both tests Local stageout update 8

Conclusion We have developed a tool which can be used in conjunction with local stageout The tool copies files from the local to the remote SE if the fjr s indicate that remote stageout failed The tool is robust against failure at any of the various steps of the procedure It can be used incrementally to retry the copying of files that did not succeed on the previous pass It is ready for inclusion in CRAB 2.7.4 Local stageout update 9