-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide workflow/job information to MONIT #11140
Comments
@amaltaro , @todor-ivanov , @klannon , @dciangot this issue requires coordination with data-ops group to figure out which information is required for their monitoring needs. Once this will be clear we can discuss how to fetch it from WMStats. |
Without really thinking about this. We might be able to accommodate most of this with the data already uploaded to WMArchive. |
@amaltaro just wondering if there is an update on how we can get the data for tracking. |
I realize this isn't high on your priority list but Christoph asks about it weekly at the L2/3 meeting so if we can get some sort of feedback that would be great. |
Before we starting working on yet another monitoring system, I think we should evaluate what is already available from the current monitoring systems:
Once we know what is missing and which kind of information we need, then we can discuss how to make that available. In short, I am pretty sure that CMS Job Monitoring has the job retry. Valentin, can you please check whether job retry information is available in the information uploaded from the agents? |
@amaltaro , we already provided information you asked for, see dmwm/CMSMonitoring#141
And, Jen asked for specific information which I doubt is presented either in WMArchive or Condor data since it is related to WMAgebt activity on certain site. In other words it is WMAgent (intermediate) information which we do not store in either of the above sources. |
Thanks for these pointers, Valentin. From the WMArchive schema that you provided, indeed it seems we do not have the retry number information in the document uploaded to WMArchive. I think we should modify this schema and make sure the job retry number is there. Regarding CMS Job Monitoring (index monit_prod_condor_raw_metric*), one can build a visualization using (
I'd suggest to have someone (Jen, or the monitoring team) trying to build some visualization in Kibana/Grafana. |
Hi @jenimal @amaltaro [1] https://monit-grafana.cern.ch/d/6BfIQzC7z/wmcore-job-retry-count?orgId=11 |
I'm trying to interpret exactly what I am seeing in the plot. |
Jen, this is just an example. The plot shows the sum of jobRetry counts. We can display the average, or the max. Just give us the exact requirements of the plot you want.
The aggregation is done at the HTCondor job level.
… On 7 Jun 2022, at 17:19, Jen Adelman-McCarthy ***@***.***> wrote:
I'm trying to interpret exactly what I am seeing in the plot.
Why are we having 2255 retries on a particular job for the campaign RunIISummer20UL18GEN I'm guessing that is retries overall for a workflow, not on a particular job. so I don't think we are getting what we are actually looking for in this plot.
—
Reply to this email directly, view it on GitHub <#11140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ4EWQK22ZMXHRE33WVJIR3VN5SAPANCNFSM5VVIFEMA>.
You are receiving this because you were mentioned.
|
I think the main question here is do ops need information per workflow or per job. If former, the condor data can be used, if latter then I think we need WMStats info. |
Hi @jenimal , I created the plot according to my assumptions. It just sums the retries for the group by value in defined binning which you can select in the dashboard. To improve the plot, could you please define what kind of graph we're trying to see using the current source[2]? May be we can use "WMAgent_JobID" field for individual job retries, but need confirmation. In short, I need more input from you to create the plot. You can see fields and their values in Kibana[2]. [2] https://monit-kibana.cern.ch/kibana/goto/4cbc8ab9235b687163eb6287bb129c4d |
I leave Jen to provide the requirements for this visualization. @mrceyhun However, from my side, I would say that having a histogram (or pie-chart) with the distribution of jobs with exit code == 0 grouped by their retry number would be extremely helpful. I can't think of a good way to make it a time series, but you might. An improvement for such information would be to be able to classify it in one of these:
|
Alan has good suggestions above. |
Hi @jenimal @amaltaro All panels count unique "GlobalJobId". What do you think? [3] https://monit-grafana.cern.ch/d/6BfIQzC7z/wmcore-job-retry-count?orgId=11 |
I think we are definately somewhere with this latest plot! Let me run it by Christoph and see what he thinks. |
@mrceyhun I have two observations on this dashboard:
|
@jenimal @leggerf Jen, Federica, we are trying to organize the issues to be planned for Q4/2023, and this has been ranked as the most important ticket under the "Monitoring" requests (other than campaign and task type, which we are likely delivering still in Q3). I don't really know if this question has to be directed to the Monitoring or to the P&R team, so here goes my attempt. If there is still work to be done, is it supposed to be on the WMCore (like lack of information in the monitoring database)? Or is it supposed to be on the Monitoring side to build dashboards based on data already available? If it is on WMCore, then could you please specific clear requirements? What exactly are we lacking in monitoring that you need to have? Thanks! |
not sure if there is anything else needed on our side. Ceyhun provided a dashboard. I'm adding @nikodemas so that he is also aware of this thread. I think it is up to @jenimal to decide if information is sufficient or not |
Given the lack of activity in this ticket, I am going to close this out. However, if there is a reason to keep it open, please do so. If it's about more/new/better monitoring information, we might want to stick with a fresh new ticket as well. Thank you Jen, Federica et al. |
fine for me |
Impact of the new feature
The data-ops would like to have better understanding of job management.
Is your feature request related to a problem? Please describe.
As described in dmwm/CMSMonitoring#141 Jen wants to see the following:
Describe the solution you'd like
There are two solutions to this problem:
Describe alternatives you've considered
At the moment, we have
CMS_JobRetryCount
information in ES condor data, see here, but we are not sure if it is sufficient for requested use-case.Additional context
We may review WMStats information and its applicability for CMS Monitoring and have broader discussion with CMS Monitoring and data-ops groups.
@mrceyhun , @leggerf , @brij01 , @jenimal
The text was updated successfully, but these errors were encountered: