Provide workflow/job information to MONIT #11140

vkuznet · 2022-05-11T15:01:51Z

Impact of the new feature
The data-ops would like to have better understanding of job management.

Is your feature request related to a problem? Please describe.
As described in dmwm/CMSMonitoring#141 Jen wants to see the following:

simple number of retries for each job I suspect this number is 0/1 but Christoph wants to know this as the only ones we ever look at in the end are the failed jobs
Exit code/site # retries - as this could tell us if there is a site issue that we are missing
Exit Code/retry/workflow and or Campaign or dataset because sometimes if it is a file read issue if we can spot a spike of retries that isn't settling down as a workflow progresses we may need to make more replicas.

Describe the solution you'd like
There are two solutions to this problem:

Feed WMStats information to CMS MONIT infrastructure, a.k.a. push approach
- in this approach we need to decide which information to push to CMS MONIT infrastructure and at which interval
- it will require decision on how information should be pushed, e.g. directly to CERN AMQ brokers (code should be developed for that), or use CMSAMQProxy server (in this case HTTP requests are sufficient)
- we'll need to decide on data-format and schema
Add APIs to WMStats (or any other service) which can provide this information to upstream caller, i.e. pull approach
- in this scenario, we need to convert WMStats from JavaScript application within CouchDB to real HTTP service with (RESTful) APIs
- we'll need to decide on data-format and API metrics

Describe alternatives you've considered
At the moment, we have CMS_JobRetryCount information in ES condor data, see here, but we are not sure if it is sufficient for requested use-case.

Additional context
We may review WMStats information and its applicability for CMS Monitoring and have broader discussion with CMS Monitoring and data-ops groups.

@mrceyhun , @leggerf , @brij01 , @jenimal

The text was updated successfully, but these errors were encountered:

vkuznet · 2022-05-11T15:05:40Z

@amaltaro , @todor-ivanov , @klannon , @dciangot this issue requires coordination with data-ops group to figure out which information is required for their monitoring needs. Once this will be clear we can discuss how to fetch it from WMStats.

amaltaro · 2022-05-11T16:03:23Z

Without really thinking about this. We might be able to accommodate most of this with the data already uploaded to WMArchive.

jenimal · 2022-05-19T14:51:39Z

@amaltaro just wondering if there is an update on how we can get the data for tracking.

jenimal · 2022-06-06T19:14:38Z

I realize this isn't high on your priority list but Christoph asks about it weekly at the L2/3 meeting so if we can get some sort of feedback that would be great.

amaltaro · 2022-06-06T20:06:09Z

Before we starting working on yet another monitoring system, I think we should evaluate what is already available from the current monitoring systems:

WMArchive
Job Monitoring in Kibana

Once we know what is missing and which kind of information we need, then we can discuss how to make that available.

In short, I am pretty sure that CMS Job Monitoring has the job retry. Valentin, can you please check whether job retry information is available in the information uploaded from the agents?

vkuznet · 2022-06-06T21:17:35Z

@amaltaro , we already provided information you asked for, see dmwm/CMSMonitoring#141

WMArchive schema is here
JobMonitoring data comes from condor. In Kibana it correspond to monit_prod_condor_raw_metric index and you can view single document here

And, Jen asked for specific information which I doubt is presented either in WMArchive or Condor data since it is related to WMAgebt activity on certain site. In other words it is WMAgent (intermediate) information which we do not store in either of the above sources.

amaltaro · 2022-06-07T03:45:11Z

Thanks for these pointers, Valentin.

From the WMArchive schema that you provided, indeed it seems we do not have the retry number information in the document uploaded to WMArchive. I think we should modify this schema and make sure the job retry number is there.

Regarding CMS Job Monitoring (index monit_prod_condor_raw_metric*), one can build a visualization using (data object):

"CMS_WMTool": "WMAgent",
"CMS_JobType": "Production",  # to classify job groups
"CMS_JobRetryCount": 0,
"ExitStatus": 0,  # this gives the exit code for the job as a whole
"Chirp_WMCore_cmsRun_ExitCode": 0,  # this gives the exit code for the cmsRun status
"GLIDEIN_CMSSite": "T2_US_Wisconsin",  # or one of the variations that report the site name where the job ran

I'd suggest to have someone (Jen, or the monitoring team) trying to build some visualization in Kibana/Grafana.
Regardless whether that is enough or not, I also think we should investigate the feasibility of providing the job retry number to WMArchive as well, likely to be considered for the next quarter only though.

mrceyhun · 2022-06-07T12:09:46Z

Hi @jenimal @amaltaro
I created [1] which sums CMS_JobRetryCount ClassAdd values which is grouped by Group By term. I put already variables and filters, you can give it a try and ask for further suggestions. Hope it helps.

[1] https://monit-grafana.cern.ch/d/6BfIQzC7z/wmcore-job-retry-count?orgId=11

jenimal · 2022-06-07T15:19:22Z

I'm trying to interpret exactly what I am seeing in the plot.
Why are we having 2255 retries on a particular job for the campaign RunIISummer20UL18GEN I'm guessing that is retries overall for a workflow, not on a particular job. so I don't think we are getting what we are actually looking for in this plot.

leggerf · 2022-06-07T15:23:47Z

Jen, this is just an example. The plot shows the sum of jobRetry counts. We can display the average, or the max. Just give us the exact requirements of the plot you want. The aggregation is done at the HTCondor job level.

…

On 7 Jun 2022, at 17:19, Jen Adelman-McCarthy ***@***.***> wrote: I'm trying to interpret exactly what I am seeing in the plot. Why are we having 2255 retries on a particular job for the campaign RunIISummer20UL18GEN I'm guessing that is retries overall for a workflow, not on a particular job. so I don't think we are getting what we are actually looking for in this plot. — Reply to this email directly, view it on GitHub <#11140 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AJ4EWQK22ZMXHRE33WVJIR3VN5SAPANCNFSM5VVIFEMA>. You are receiving this because you were mentioned.

vkuznet · 2022-06-07T15:32:41Z

I think the main question here is do ops need information per workflow or per job. If former, the condor data can be used, if latter then I think we need WMStats info.

mrceyhun · 2022-06-07T16:33:05Z

Hi @jenimal , I created the plot according to my assumptions. It just sums the retries for the group by value in defined binning which you can select in the dashboard. To improve the plot, could you please define what kind of graph we're trying to see using the current source[2]?

May be we can use "WMAgent_JobID" field for individual job retries, but need confirmation. In short, I need more input from you to create the plot. You can see fields and their values in Kibana[2].

[2] https://monit-kibana.cern.ch/kibana/goto/4cbc8ab9235b687163eb6287bb129c4d

amaltaro · 2022-06-07T17:12:07Z

I leave Jen to provide the requirements for this visualization.

@mrceyhun However, from my side, I would say that having a histogram (or pie-chart) with the distribution of jobs with exit code == 0 grouped by their retry number would be extremely helpful. I can't think of a good way to make it a time series, but you might.

An improvement for such information would be to be able to classify it in one of these:

job type (CMS_JobType classad, AFAIR)
workflow name, or task name, or prep_id (unsure which one is more meaningul)
site name

jenimal · 2022-06-07T19:05:44Z

Alan has good suggestions above.
I think a pie chart would be a good way to easily see how many jobs succeed after each try.
Another way to classify info that would be valuable would be input dataset name. This could help us to identify which datasets are being heavily read, and jobs failing because the files are "busy" and having to retry.

mrceyhun · 2022-06-07T20:44:30Z

Hi @jenimal @amaltaro
Here is the new dashboard[3] which uses suggested group-by terms . There are 2 rows. Each row creates "CMS_JobRetryCount" pie-charts for each retry count value. Doing so, we can show more data without pushing ES aggregation limit.

All panels count unique "GlobalJobId".
1st row shows values that grouped by Workflow and Site
2nd row shows values that grouped by Campaign. I selected this field in place of "dataset information" which does not exist in ClassAds (I don't know if we can use it or not but "DESIRED_dataset" is mostly null).

What do you think?

[3] https://monit-grafana.cern.ch/d/6BfIQzC7z/wmcore-job-retry-count?orgId=11

jenimal · 2022-06-08T15:54:30Z

I think we are definately somewhere with this latest plot! Let me run it by Christoph and see what he thinks.
And interesting thing that jumps out on you is that we are having to retry 3X before we get to success now to go figure out what our dominate exit code is leading up to that and figure out why we are being so inefficient.

amaltaro · 2022-06-08T16:02:16Z

@mrceyhun I have two observations on this dashboard:

isn't it missing retry count 0? Meaning, the job succeeded in the first attempt (without any retries)
looking at the agent configuration, the max number of retries we have defined is 4 (but if you could make it automatically, it would be better)

mrceyhun · 2022-06-09T10:22:37Z

Hi @amaltaro

Since it will increase group-by aggregations (and we've ES limit), I arranged the plots to show only retry count > 0. I created separate plots for retry count = 0 values now.
Retry count values are already fetched from ES index automatically in every time-range change.
Dashboard

amaltaro · 2023-09-22T01:14:54Z

@jenimal @leggerf Jen, Federica, we are trying to organize the issues to be planned for Q4/2023, and this has been ranked as the most important ticket under the "Monitoring" requests (other than campaign and task type, which we are likely delivering still in Q3).

I don't really know if this question has to be directed to the Monitoring or to the P&R team, so here goes my attempt.
Could one of you please clarify where we stand with this?
Is this ticket still relevant?
Is this information already provided in other monitoring dashboard?

If there is still work to be done, is it supposed to be on the WMCore (like lack of information in the monitoring database)? Or is it supposed to be on the Monitoring side to build dashboards based on data already available?

If it is on WMCore, then could you please specific clear requirements? What exactly are we lacking in monitoring that you need to have? Thanks!

leggerf · 2023-09-22T08:11:47Z

not sure if there is anything else needed on our side. Ceyhun provided a dashboard. I'm adding @nikodemas so that he is also aware of this thread. I think it is up to @jenimal to decide if information is sufficient or not

amaltaro · 2023-12-04T20:45:17Z

Given the lack of activity in this ticket, I am going to close this out. However, if there is a reason to keep it open, please do so. If it's about more/new/better monitoring information, we might want to stick with a fresh new ticket as well. Thank you Jen, Federica et al.

leggerf · 2023-12-05T08:42:20Z

fine for me

vkuznet added New Feature Monitoring labels May 11, 2022

vkuznet mentioned this issue May 11, 2022

How many times do we retry a single job, on average dmwm/CMSMonitoring#141

Open

amaltaro added the Further Discussion label May 11, 2022

amaltaro changed the title ~~Provide WMStats information to MONIT~~ Provide workflow/job information to MONIT Sep 22, 2023

amaltaro closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide workflow/job information to MONIT #11140

Provide workflow/job information to MONIT #11140

vkuznet commented May 11, 2022

vkuznet commented May 11, 2022

amaltaro commented May 11, 2022

jenimal commented May 19, 2022

jenimal commented Jun 6, 2022

amaltaro commented Jun 6, 2022

vkuznet commented Jun 6, 2022

amaltaro commented Jun 7, 2022

mrceyhun commented Jun 7, 2022

jenimal commented Jun 7, 2022

leggerf commented Jun 7, 2022 via email

vkuznet commented Jun 7, 2022 •

edited

Loading

mrceyhun commented Jun 7, 2022

amaltaro commented Jun 7, 2022

jenimal commented Jun 7, 2022

mrceyhun commented Jun 7, 2022

jenimal commented Jun 8, 2022

amaltaro commented Jun 8, 2022

mrceyhun commented Jun 9, 2022

amaltaro commented Sep 22, 2023

leggerf commented Sep 22, 2023

amaltaro commented Dec 4, 2023

leggerf commented Dec 5, 2023

Provide workflow/job information to MONIT #11140

Provide workflow/job information to MONIT #11140

Comments

vkuznet commented May 11, 2022

vkuznet commented May 11, 2022

amaltaro commented May 11, 2022

jenimal commented May 19, 2022

jenimal commented Jun 6, 2022

amaltaro commented Jun 6, 2022

vkuznet commented Jun 6, 2022

amaltaro commented Jun 7, 2022

mrceyhun commented Jun 7, 2022

jenimal commented Jun 7, 2022

leggerf commented Jun 7, 2022 via email

vkuznet commented Jun 7, 2022 • edited Loading

mrceyhun commented Jun 7, 2022

amaltaro commented Jun 7, 2022

jenimal commented Jun 7, 2022

mrceyhun commented Jun 7, 2022

jenimal commented Jun 8, 2022

amaltaro commented Jun 8, 2022

mrceyhun commented Jun 9, 2022

amaltaro commented Sep 22, 2023

leggerf commented Sep 22, 2023

amaltaro commented Dec 4, 2023

leggerf commented Dec 5, 2023

vkuznet commented Jun 7, 2022 •

edited

Loading