-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cleaner-hsm: info
command needs more info, and better logging
#7581
Comments
Apparently our cleaner-hsm was a bit confused. I restarted it, and now at least it seems to log to which node it is sending runs.
Not sure though why the About the state of the cleaner: could the following lines (from before the restart) shed any light on it having a huge backlog?
|
Hi Onno.
|
Hi Lea, Here is our current config. I included the cleaner-disk settings.
We've done considerable tuning over the last few days to try and get rid of the cleaner-hsm backlog. After the restart, the cleaner-hsm reverted to these settings. I'm not sure what you mean with: "but this does not make sense for cleaner-hsm". |
I just noticed this log entry:
Why are there two different pool names in this entry? Is one pool sending or forwarding work to another? |
Motivation: Modification: To cleaner-hsm's admin `info` command, add information about which pools are currently waited for. These pools are instructed to delete files from an attached HSM and should reply with success for failure once done. Additionally, change pool selection behaviour to only select one pool for hsm-cleaning at a time. Result: More information for admins: waited-for pools are now shown in cleaner-hsm's `info` command. Target: master Requires-notes: no Requires-book: no Addresses: #7581 Acked-by: Tigran Mkrtchyan Patch: https://rb.dcache.org/r/14270/
Dear dCache devs,
I'm currently troubleshooting
cleaner-hsm
congestion (version 9.2.18). A challenge I'm facing is that it can't see which runs the cleaner has submitted to which pools, and which of these runs it is still expecting a reply from.We have seen that our HSM script did too many retries for removals, and this caused timeouts between cleaner-hsm and pool. Since the cleaner will do retries anyway, we have now disabled retries for removals in our HSM script. Still, things are not as smoothly as I hoped. It currently is very difficult to see:
It would be very helpful if such information was shown by the
info
command.As for logging: with logging set to debug, new jobs are shown as "New run...", without any useful information. Only when a job finishes, there is some relevant info:
It would be helpful if the "New run" log entry included more details, especially to which node the run is submitted (so that I may monitor it there).
Additionally, there is this setting in the cleaner-hsm:
dcache/skel/share/defaults/cleaner-hsm.properties
Line 64 in 0597b1f
It's currently not clear (to me at least) what effect this setting has. Is it the maximum number of concurrent runs? It might help if the
info
command showed what these threads were doing.If someone were to pick this up, it might be efficient to apply similar changes to the
cleaner-disk
.Thanks!
The text was updated successfully, but these errors were encountered: