-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DEV] Suspicious replica recoveror: Configure conveyor-finisher such that it starts marking replicas suspicious #692
Comments
I've added the suspicious pattern to the finisher config in the integration:
Now working on producing relevant transfer errors to see whether finisher will be able to mark them as suspicious |
I produced some stuck rules with checksum problems in the integration. The finisher is able to spot them, however not able to mark them as suspicious
Problem seems to be in the |
Okay, I confirm that the reason why conveyor-finisher cannot mark a replica suspicious in the bad_replicas table is this ATLAS specific code [1] In particular the did name is computed incorrectly. For instance, the replicas variable gets the following value
for the following pfn
where the @ericvaandering @voetberg you mentioned that this is already known. Should we handle this as a 3rd step in [2] or is there already a ticket open for this in Rucio? [1] https://github.com/rucio/rucio/blob/e819e3c001d8613076728ffd8f424fa3f2bdc812/lib/rucio/core/replica.py#L273 |
@haozturk Yep, this is already known and there's an open PR with a solution. Issue rucio/rucio#6363 and PR rucio/rucio#6444. It's based on vo's instead of config settings but should have similar results. |
Ah, thanks a lo @voetberg . I'll keep this issue in |
Hi @voetberg I patched integration cluster with your PR 6444 for testing and I see that we still can't mark replicas suspicious. This is the relevant snippet from the finisher logs:
and this shows that we couldn't mark them suspicious:
rucio-int is an alias to Can you please help me understand what's the issue? |
6444 only covers correctly parsing the scope and name in "declare bad replicas". Did you also patch with rucio/rucio#6764 ? |
6764 shouldn't be necessary to mark replicas suspicious. The suspicious replica recoveror daemon doesn't mark replicas suspicious. It processes the replicas that are marked suspicious and declares them bad when necessary. I'm testing out the parts which mark replicas suspicious which happens in two places in rucio: kronos and conveyor-finisher. In the context of this issue, I'm testing the latter and simply this line [1] doesn't work [1] https://github.com/rucio/rucio/blob/master/lib/rucio/daemons/conveyor/finisher.py#L373 |
Ah okay I misunderstood you there - was looking for the obvious solution. Then I'm not immediately sure, I'm as stumped as you are. Can you verify the test rse is deterministic? Or else I can jump in and start adding logging statements everywhere to see where this gets stuck |
Thanks Maggie, yes it's deterministic:
It'd indeed help adding more logging. Current logs don't help much. I couldn't figure out an easy way to add logging lines. Probably we need to add new log lines, make a temporary patch, deploy it and watch what's going on. |
Hi @voetberg I've found what's wrong.
it yields
whereas it should've been
which is computed here: https://github.com/rucio/rucio/blob/cf9c2c7afad37fc5b7bdd782e281918b4230b2f1/lib/rucio/core/replica.py#L275 As far as I'm aware Does it sound reasonable? You want me to fix this upstream? |
@haozturk The way the code is set up, you may be able to write a plugin that just returns "cms" for the scope for any pfn. Right now it's written so that it uses VO to determine the code to use in the scope from pfn. I think it would be as simple as patching in
|
Thanks @voetberg indeed this would be the most convenient. However, I'm a bit worried for the exceptions where scope isn't Lastly, we need to change the way we extract the name/lfn, but that should be easy. |
Igor did something similar here, rucio/rucio#6350 |
We fixed the relevant issues in upstream rucio. This is done and deployed |
Context: https://indico.cern.ch/event/1356295/
Needed for #403
I suggest to put the following patterns to the config in integration and watch how it goes:
Even if we do it in production, it should do nothing but marking some replicas suspicious depending on transfer errors, which has no effect as long as
suspicious-replica-recoveror
daemon is not running. But let's not do anything in production for now, especially before the xmas break @ericvaandering @dynamic-entropy any thoughts?The text was updated successfully, but these errors were encountered: