Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DEV] Suspicious replica recoveror: Enrich rucio tracers to include file read errors #691

Open
haozturk opened this issue Dec 19, 2023 · 9 comments
Assignees
Labels
dev development activity

Comments

@haozturk
Copy link
Contributor

haozturk commented Dec 19, 2023

Needed for #403 . Traces come from

 WMArchive: /topic/cms.jobmon.wmarchive 
 CMSSWPOP: /topic/cms.swpop 
 xrootd: /topic/xrootd.cms.aaa.ng

I reckon we need to talk to the producers of these topics. I reckon, it's WMCore team for WMArchive, Bockjoo for xrootd. How about CMSSWPOP? Does CRAB push any data to AMQ? @ericvaandering any clues?

Context: https://indico.cern.ch/event/1356295/

@haozturk haozturk added the dev development activity label Dec 19, 2023
@haozturk haozturk self-assigned this Dec 19, 2023
@ericvaandering
Copy link
Member

I'd start with Matti Kortelainen for CMSSW.

@haozturk haozturk added this to the Recover suspicious replicas milestone Dec 19, 2023
@haozturk
Copy link
Contributor Author

Thanks Eric, I'll contact him. For reference, here are the links to the existing data:

  1. CMSSWPOP
  2. xrootd
  3. WMArchive

@haozturk
Copy link
Contributor Author

For WMArchive; I see that it already has the error information. See this link for production and this link for CRAB and look at data.steps field of a random entry. The only problem is that it's not indexed, so it's not possible do queries using it. Now my plan is to make changes in rucio-tracers repo such that we parse this info and push it to /topic/cms.rucio.tracer in the right field.

For xrootd and cmssw; we still don't know how to do it. Bockjoo doesn't know for AAA and I didn't get a reply from Matti, yet. I'll keep investigating

@makortel
Copy link

To my understanding the "CMSSW popularity" information originates from CMSSW's StatisticsSenderService that sends UDP packets to "somewhere". The Service sends the UDP packet with bunch of information whenever the primary / secondary(=two-file solution) / embedded(=pileup) file is closed. While extending the data sent in via UDP would be straightforward (it's JSON after all), adding information on file read errors specifically does not look straightforward. If you really want to, we can take a deeper dive on what the implementation would entail, in which case please open a feature request issue in CMSSW GitHub.

Before committing to any development I'd like to understand why the information in WMArchive (that is filled from the CMSSW framework job reports from both production and CRAB(?)) would not be sufficient. Do you e.g. want to catch the read errors from all the users' non-CRAB jobs as well?

@haozturk
Copy link
Contributor Author

Thanks @makortel this is useful. I agree that we should start with WMArchive.

@yuyiguo I think you're one of the developers of rucio-tracers. In the first glance, it seems this task can be accomplished by feeding the errors of data.steps field in WMArchive into stateReason field of rucio traces. I'm looking into how this can be accomplished. If you have comments on the subject before I start the implementation, it's very much appreciated. My only worry is that errors field can be quite large in size. I don't know whether this would cause any issue.

@haozturk
Copy link
Contributor Author

haozturk commented Mar 6, 2024

Hi @ericvaandering @yuyiguo How can I test my changes in rucio-tracers? Is there a test queue that I can use to consume my implementation?

Edit: Adding in @dynamic-entropy as well in case he knows

@dynamic-entropy
Copy link
Contributor

I never looked at this, so cannot give an exact answer. But you can subscribe to the same queue with a different client and you will receive the same events without affecting prod.

@haozturk
Copy link
Contributor Author

haozturk commented Mar 6, 2024

Thanks Rahul. We had a chat with Rahul and Nikodemas offline and we'll request a new subscriber for this queue to be used for testing. If anybody has already a test subscriber for this queue, please let me know, so that we can avoid double work

@belforte
Copy link
Member

belforte commented May 2, 2024

I should have read this issue earlier...

  • We intend to decommission WMArchive for CRAB, and we are not sending data there since a few weeks.
  • my understanding is that it is difficult/impossible for CMSSW to send info about root fatal errors since the info is in the xroot excepton which is not parsed by the framework
  • I would be happy to make CRAB send an UDP packed just like CMSSW does when we discover a hint of corrupted file in the logs. Can someone tell me how to do it and what the format should be ?
  • Or CRAB can send directly to the queue mentioned above. In this case better to get @mapellidario involved since he knows about STOMP already (if STOMP is needed here).
  • whichever means, I think we should send one (short) message per file and let Rucio decide how many reports to get before taking action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev development activity
Projects
None yet
Development

No branches or pull requests

5 participants