Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and report corrupted files #7867

Merged

Conversation

belforte
Copy link
Member

@belforte belforte commented Sep 8, 2023

a first step toward fixing #7548
this PR includes code to write a JSON file in /eos/cms/store/temp/user/corrupted/ which we can use to check that things are OK before calling rucioClient.declare_suspicious_file_replicas()

belforte@lxplus806/~> ls /eos/cms/store/temp/user/corrupted/
230908_074255:belforte_crab_20230908_094238.job.5.1.json
230908_074255:belforte_crab_20230908_094238.job.5.2.json
230908_074255:belforte_crab_20230908_094238.job.5.3.json
belforte@lxplus806/~> cat /eos/cms/store/temp/user/corrupted/230908_074255:belforte_crab_20230908_094238.job.5.3.json|jq
{
  "DID": "cms:/store/data/Run2018C/ParkingBPH5/MINIAOD/05May2019-v1/70000/4A628618-FB8A-804C-A905-A046244B6DF3.root",
  "RSE": "T2_US_Wisconsin",
  "exitCode": 8020,
  "message": [
    "== CMSSW:       [c] Fatal Root Error: @SUB=TStorageFactoryFile::Init\n",
    "== CMSSW: file root://cmsxrootd.fnal.gov//store/data/Run2018C/ParkingBPH5/MINIAOD/05May2019-v1/70000/4A628618-FB8A-804C-A905-A046244B6DF3.root is truncated at 2183135232 bytes: should be 4128873606, trying to recover\n"
  ]
}
belforte@lxplus806/~> 

There are also a few simple pylint fixes

@belforte
Copy link
Member Author

belforte commented Sep 8, 2023

I need to rebase

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 7 warnings
    • 100 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1689/artifact/artifacts/PullRequestReport.html

@belforte belforte force-pushed the detect-and-report-corrupted-files-fix-7548 branch from e37c9bc to 43d4a34 Compare September 8, 2023 15:22
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 7 warnings
    • 100 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1690/artifact/artifacts/PullRequestReport.html

@belforte belforte force-pushed the detect-and-report-corrupted-files-fix-7548 branch from 43d4a34 to 6e28a8c Compare September 8, 2023 15:32
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 7 warnings
    • 100 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1691/artifact/artifacts/PullRequestReport.html

@belforte
Copy link
Member Author

belforte commented Sep 8, 2023

I tested this locally on crab-dev-tw01
It should be reasonably safe, pylint aside the change is the addition of the new check_corrupted_file() method which is called when exit code is 8020/8021/8028

But surely a look by a second couple of eyes will be good.

Copy link
Contributor

@novicecpp novicecpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Stefano. Please see the inline comments.

src/python/TaskWorker/Actions/RetryJob.py Outdated Show resolved Hide resolved
src/python/TaskWorker/Actions/RetryJob.py Outdated Show resolved Hide resolved
@belforte
Copy link
Member Author

good points. I have more fundamental issues now, see #7548 , but will get to these as well.
A first step is indeed to see how many open errors are reported as corrupted files.
Timeout is needed "immediately", yes !

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 8 warnings
    • 99 comments to review
  • Pycodestyle check: succeeded
    • 76 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1692/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 8 warnings
    • 99 comments to review
  • Pycodestyle check: succeeded
    • 77 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1693/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 8 warnings
    • 99 comments to review
  • Pycodestyle check: succeeded
    • 75 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1694/artifact/artifacts/PullRequestReport.html

@belforte
Copy link
Member Author

I did some changes and after the gfal timeout and this https://github.com/belforte/CRABServer/blob/7be608a0a25fb4149b0ed3ff7a8a68312450d2d1/src/python/TaskWorker/Actions/RetryJob.py#L313-L321
I hope that it is robust enough to be put in production and see what happens.

@novicecpp maybe you can consider using this you your copy-cat tests, to give it a first shake ?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 8 warnings
    • 99 comments to review
  • Pycodestyle check: succeeded
    • 75 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1695/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@novicecpp novicecpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (+an inline suggestion).
Thanks Stefano.

Comment on lines +398 to +402
"""
check if job stdout contains a message indicating a corrupted file and reports this
via a json file taskname.corrupted.job.<crabid>.<retry>.json
returns True/Falso accordingly to corrupted yes/no
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
check if job stdout contains a message indicating a corrupted file and reports this
via a json file taskname.corrupted.job.<crabid>.<retry>.json
returns True/Falso accordingly to corrupted yes/no
"""
"""
check if job stdout contains a message indicating a corrupted file and reports this
via a json file taskname.corrupted.job.<crabid>.<retry>.json
returns True/Falso accordingly to corrupted yes/no
Ref: https://github.com/dmwm/CRABServer/issues/7548
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ref for contexts and the schema of logs we are trying to parse.

@novicecpp
Copy link
Contributor

novicecpp commented Sep 13, 2023

@novicecpp maybe you can consider using this you your copy-cat tests, to give it a first shake ?

Sure. I will deploy it tomorrow, in test12.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Pylint check: succeeded
    • 8 warnings
    • 99 comments to review
  • Pycodestyle check: succeeded
    • 75 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-CRABServer-PR-test/1702/artifact/artifacts/PullRequestReport.html

@belforte belforte merged commit bdc3abe into dmwm:master Sep 18, 2023
@belforte belforte deleted the detect-and-report-corrupted-files-fix-7548 branch September 18, 2023 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants