-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How many times do we retry a single job, on average #141
Comments
Jen, monitoring does not appear by magic, somebody first need to write such information somewhere. In order to answer your request please provide more information where this information is stored and who is feeding this information to which system. Once we can understand how information is fed to the (some) system it would be clear how to answer your question. I also doubt this repo is a right place to request this info. It is repository of tools CMS Monitoring group develop, while your request is very vague as well. Do you need a CLI tool, a plot, or dashboard? What would be input parameters for such tool/plot/dashbaord? How will you interact with it, via terminal or web GUI? I suggest that you re-structure your request as foollowing:
or may be you want to pass a pattern, etc.
Once we have a clear picture then we can discuss how to implement it. |
Of course it doesn't appear by magic, otherwise I would have made it appear. Within WmStats I cam click on "State Transition" and see this: State Transition
Depending on the exit code the number of max retries is hard coded I believe the number of retries is stored in WmArchive What we would like to be able to cross reference is (plotted) :
|
Jen, the WMStats is not part of CMS Monitoring and used internal CouchDB backend which is not integrated within CMS Monitoring. Therefore, in order to use this information someone needs to propagated this info either to MONIT or provide APIS which we can use to fetch this info. In both case @amaltaro can answer if it is possible. I also not sure if WMArchive really contains this info. If we look at WMArchive schema I don't see that any retry info is stored in WMArchive documents. Again, @amaltaro can tell us more. Therefore, even you clarify few bits it is far from clear that we have any information in CMS Monitoring landscape about it. The information is kept within DMWM internal tools and in order to use we need broader discussion how to access it (queries, periodicity, etc.), how to aggregate it and store it before making its visualization. I think information belongs to DMWM and it should push it to MONIT, e.g. we have proxy server for that. Once information will be in MONIT, e.g. in ES, it would be easier to visualize it. While, if we keep information within WMStats we need to adapt usage of some API to extract/query it from this system and I'm not aware that they exists. |
Hi @jenimal , |
I created separate issue within dmwm/WMCore#11140 where we should decide on the approach and which information we need in CMS Monitoring. Once information will be there we can discuss here how to yield it back to the users and which tools/plots/dashboards we can provide. |
Thanks @vkuznet |
WmCore retries jobs multiple times to take care of temporal issues within the system.
I can't find any monitoring that tells me on average how many times a job gets resubmitted. The vast majority of jobs obviously succeed in the end but are they succeeding on the 1st try? The 3rd try?
The ability to break this down by site, and exit code might also be useful so we can monitory what sites are causing more retries than others and for what reasons. This would help us to better understand the effectiveness of each site, and if a particular site sticks out as always needing to retry many times they may have an undetected issue that is effecting efficiency.
Jen
The text was updated successfully, but these errors were encountered: