-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse cmsRun stdout/err to detect corrupted root files #7548
Comments
another example (this job exited with 8021 btw) Not clear to me if msg is from xroot or cmssw and if we manage
|
Since the cmsRun stdout is brought back to the scheduler, even when jobs are killed on WN thanks to latest changes by Daro in #7529 , I think this is better addressed inside the PostJob. Makes debugging and updating much easier. |
a more classic example of corrupted file (job completed with 8028)
|
so far I have 3 examples of bad file all have So I will start looking for that. this seems to work, as user
I can also cleanup
|
this will be cleaner
i.e. I will put a one-line file in
files there will be automatically cleaned by EOS after 30 days. |
more examples in https://cms-talk.web.cern.ch/t/corrupted-missing-parkingbph-miniaod-files/29281/1 they all fit the pattern
|
initial implementation in #7867 this PR includes code to write a JSON file in
for each of those files one can obtain the PFN to be flagged (cfr https://rucio.cern.ch/documentation/client_api/replicaclient#declare_suspicious_file_replicas ) from CLI |
my code currently fails on this
since the line after So I guess I will start by skipping errors where the line after
Maybe we should build some kind of updatable heuristic based on regexps ??? Sounds a bit far-fetched ! |
I also need some protection in
|
For sake of testing I have created a truncated file with bash
|
I do not find a reliable way to tell if corrupted file is local, or was opened at a remote file via xrootd fallback.
pretty much same as at CERN
|
I had a faint hope that the above was due to running interactively on LPC, where maybe the local storage.json has different fallback structure, but when e.g. running at Vanderbilt I still get a report which does not mention anywhere that the corruped file is at CERN !
|
side note: |
there seems to be no clear rule, I just found a truncated file resulting in 8028
|
the discussion about "what do do" goes on also in dmwm/CMSRucio#403 |
an OpenSearch dashboard to look for stdout of jobs failed with 802x: https://monit-opensearch.cern.ch/dashboards/goto/128a23aeb90a87603a043cfeae9e1774?security_tenant=global |
Next step is a script which checks reports in
|
maybe I should rename |
similar topic, now what we parse cmsRun stdout... can something be done about more mundane problems like this which was caused by a disk pool which got bad ? I.e. genuine "local site problems" which needs to be fixed by site admins !
I think that the crucial point here is to identify that the first open failure was on a local file, not a remote one. I.g.
But the info does not appear to be there. |
* add check for corrupted files for #7548 * not all FatalRoot are corruptefile. Make sure PJ does not crash * skip files unknwon to Rucio * retry jobs which report corrupted file * beware possibly undefined vars * use run-time switch to control file corruption check
Code which detects corrupted files and reports in
on each condor scheduler. Once we are confident that it is OK, the check will be removed. |
now need to finalize the script which parse summaries and reports to Rucio as per #7548 (comment) will track in a new GH issue |
There are cases where cmsRun fails w/o creating a FJR with details, e.g. segmentation faults or corrupted input files. In the latter case the details are in the xroot exception which is printed to stdout/err but CMSSW framework only sees a generic failure.
It is also possible for CMSSW to exit with an alas generic 8021 whiel stderr has things like
or
The telling string is
Fatal Root Error
which makes this different from "object such and such is not present in this data file".We should add a step in the wrapper where cmsRun stdout/err is parsed whenever the application fails w/o a FJR and set a proper exit code ourselves. This part could be common with WMA (e.g. part of existing FJR parsing).
May need to define a new exit code for corrupted files.
The text was updated successfully, but these errors were encountered: