Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide workflow/job information to MONIT #11140

Closed
vkuznet opened this issue May 11, 2022 · 22 comments
Closed

Provide workflow/job information to MONIT #11140

vkuznet opened this issue May 11, 2022 · 22 comments

Comments

@vkuznet
Copy link
Contributor

vkuznet commented May 11, 2022

Impact of the new feature
The data-ops would like to have better understanding of job management.

Is your feature request related to a problem? Please describe.
As described in dmwm/CMSMonitoring#141 Jen wants to see the following:

  1. simple number of retries for each job I suspect this number is 0/1 but Christoph wants to know this as the only ones we ever look at in the end are the failed jobs
  2. Exit code/site # retries - as this could tell us if there is a site issue that we are missing
  3. Exit Code/retry/workflow and or Campaign or dataset because sometimes if it is a file read issue if we can spot a spike of retries that isn't settling down as a workflow progresses we may need to make more replicas.

Describe the solution you'd like
There are two solutions to this problem:

  1. Feed WMStats information to CMS MONIT infrastructure, a.k.a. push approach
    • in this approach we need to decide which information to push to CMS MONIT infrastructure and at which interval
    • it will require decision on how information should be pushed, e.g. directly to CERN AMQ brokers (code should be developed for that), or use CMSAMQProxy server (in this case HTTP requests are sufficient)
    • we'll need to decide on data-format and schema
  2. Add APIs to WMStats (or any other service) which can provide this information to upstream caller, i.e. pull approach
    • in this scenario, we need to convert WMStats from JavaScript application within CouchDB to real HTTP service with (RESTful) APIs
    • we'll need to decide on data-format and API metrics

Describe alternatives you've considered
At the moment, we have CMS_JobRetryCount information in ES condor data, see here, but we are not sure if it is sufficient for requested use-case.

Additional context
We may review WMStats information and its applicability for CMS Monitoring and have broader discussion with CMS Monitoring and data-ops groups.

@mrceyhun , @leggerf , @brij01 , @jenimal

@vkuznet
Copy link
Contributor Author

vkuznet commented May 11, 2022

@amaltaro , @todor-ivanov , @klannon , @dciangot this issue requires coordination with data-ops group to figure out which information is required for their monitoring needs. Once this will be clear we can discuss how to fetch it from WMStats.

@amaltaro
Copy link
Contributor

Without really thinking about this. We might be able to accommodate most of this with the data already uploaded to WMArchive.

@jenimal
Copy link

jenimal commented May 19, 2022

@amaltaro just wondering if there is an update on how we can get the data for tracking.

@jenimal
Copy link

jenimal commented Jun 6, 2022

I realize this isn't high on your priority list but Christoph asks about it weekly at the L2/3 meeting so if we can get some sort of feedback that would be great.

@amaltaro
Copy link
Contributor

amaltaro commented Jun 6, 2022

Before we starting working on yet another monitoring system, I think we should evaluate what is already available from the current monitoring systems:

  1. WMArchive
  2. Job Monitoring in Kibana

Once we know what is missing and which kind of information we need, then we can discuss how to make that available.

In short, I am pretty sure that CMS Job Monitoring has the job retry. Valentin, can you please check whether job retry information is available in the information uploaded from the agents?

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 6, 2022

@amaltaro , we already provided information you asked for, see dmwm/CMSMonitoring#141

  • WMArchive schema is here
  • JobMonitoring data comes from condor. In Kibana it correspond to monit_prod_condor_raw_metric index and you can view single document here

And, Jen asked for specific information which I doubt is presented either in WMArchive or Condor data since it is related to WMAgebt activity on certain site. In other words it is WMAgent (intermediate) information which we do not store in either of the above sources.

@amaltaro
Copy link
Contributor

amaltaro commented Jun 7, 2022

Thanks for these pointers, Valentin.

From the WMArchive schema that you provided, indeed it seems we do not have the retry number information in the document uploaded to WMArchive. I think we should modify this schema and make sure the job retry number is there.

Regarding CMS Job Monitoring (index monit_prod_condor_raw_metric*), one can build a visualization using (data object):

"CMS_WMTool": "WMAgent",
"CMS_JobType": "Production",  # to classify job groups
"CMS_JobRetryCount": 0,
"ExitStatus": 0,  # this gives the exit code for the job as a whole
"Chirp_WMCore_cmsRun_ExitCode": 0,  # this gives the exit code for the cmsRun status
"GLIDEIN_CMSSite": "T2_US_Wisconsin",  # or one of the variations that report the site name where the job ran

I'd suggest to have someone (Jen, or the monitoring team) trying to build some visualization in Kibana/Grafana.
Regardless whether that is enough or not, I also think we should investigate the feasibility of providing the job retry number to WMArchive as well, likely to be considered for the next quarter only though.

@mrceyhun
Copy link
Contributor

mrceyhun commented Jun 7, 2022

Hi @jenimal @amaltaro
I created [1] which sums CMS_JobRetryCount ClassAdd values which is grouped by Group By term. I put already variables and filters, you can give it a try and ask for further suggestions. Hope it helps.

[1] https://monit-grafana.cern.ch/d/6BfIQzC7z/wmcore-job-retry-count?orgId=11

@jenimal
Copy link

jenimal commented Jun 7, 2022

I'm trying to interpret exactly what I am seeing in the plot.
Why are we having 2255 retries on a particular job for the campaign RunIISummer20UL18GEN I'm guessing that is retries overall for a workflow, not on a particular job. so I don't think we are getting what we are actually looking for in this plot.

@leggerf
Copy link

leggerf commented Jun 7, 2022 via email

@vkuznet
Copy link
Contributor Author

vkuznet commented Jun 7, 2022

I think the main question here is do ops need information per workflow or per job. If former, the condor data can be used, if latter then I think we need WMStats info.

@mrceyhun
Copy link
Contributor

mrceyhun commented Jun 7, 2022

Hi @jenimal , I created the plot according to my assumptions. It just sums the retries for the group by value in defined binning which you can select in the dashboard. To improve the plot, could you please define what kind of graph we're trying to see using the current source[2]?

May be we can use "WMAgent_JobID" field for individual job retries, but need confirmation. In short, I need more input from you to create the plot. You can see fields and their values in Kibana[2].

[2] https://monit-kibana.cern.ch/kibana/goto/4cbc8ab9235b687163eb6287bb129c4d

@amaltaro
Copy link
Contributor

amaltaro commented Jun 7, 2022

I leave Jen to provide the requirements for this visualization.

@mrceyhun However, from my side, I would say that having a histogram (or pie-chart) with the distribution of jobs with exit code == 0 grouped by their retry number would be extremely helpful. I can't think of a good way to make it a time series, but you might.

An improvement for such information would be to be able to classify it in one of these:

  • job type (CMS_JobType classad, AFAIR)
  • workflow name, or task name, or prep_id (unsure which one is more meaningul)
  • site name

@jenimal
Copy link

jenimal commented Jun 7, 2022

Alan has good suggestions above.
I think a pie chart would be a good way to easily see how many jobs succeed after each try.
Another way to classify info that would be valuable would be input dataset name. This could help us to identify which datasets are being heavily read, and jobs failing because the files are "busy" and having to retry.

@mrceyhun
Copy link
Contributor

mrceyhun commented Jun 7, 2022

Hi @jenimal @amaltaro
Here is the new dashboard[3] which uses suggested group-by terms . There are 2 rows. Each row creates "CMS_JobRetryCount" pie-charts for each retry count value. Doing so, we can show more data without pushing ES aggregation limit.

All panels count unique "GlobalJobId".
1st row shows values that grouped by Workflow and Site
2nd row shows values that grouped by Campaign. I selected this field in place of "dataset information" which does not exist in ClassAds (I don't know if we can use it or not but "DESIRED_dataset" is mostly null).

What do you think?

[3] https://monit-grafana.cern.ch/d/6BfIQzC7z/wmcore-job-retry-count?orgId=11

@jenimal
Copy link

jenimal commented Jun 8, 2022

I think we are definately somewhere with this latest plot! Let me run it by Christoph and see what he thinks.
And interesting thing that jumps out on you is that we are having to retry 3X before we get to success now to go figure out what our dominate exit code is leading up to that and figure out why we are being so inefficient.

@amaltaro
Copy link
Contributor

amaltaro commented Jun 8, 2022

@mrceyhun I have two observations on this dashboard:

  • isn't it missing retry count 0? Meaning, the job succeeded in the first attempt (without any retries)
  • looking at the agent configuration, the max number of retries we have defined is 4 (but if you could make it automatically, it would be better)

@mrceyhun
Copy link
Contributor

mrceyhun commented Jun 9, 2022

Hi @amaltaro

  • Since it will increase group-by aggregations (and we've ES limit), I arranged the plots to show only retry count > 0. I created separate plots for retry count = 0 values now.
  • Retry count values are already fetched from ES index automatically in every time-range change.
    Dashboard

@amaltaro
Copy link
Contributor

@jenimal @leggerf Jen, Federica, we are trying to organize the issues to be planned for Q4/2023, and this has been ranked as the most important ticket under the "Monitoring" requests (other than campaign and task type, which we are likely delivering still in Q3).

I don't really know if this question has to be directed to the Monitoring or to the P&R team, so here goes my attempt.
Could one of you please clarify where we stand with this?
Is this ticket still relevant?
Is this information already provided in other monitoring dashboard?

If there is still work to be done, is it supposed to be on the WMCore (like lack of information in the monitoring database)? Or is it supposed to be on the Monitoring side to build dashboards based on data already available?

If it is on WMCore, then could you please specific clear requirements? What exactly are we lacking in monitoring that you need to have? Thanks!

@leggerf
Copy link

leggerf commented Sep 22, 2023

not sure if there is anything else needed on our side. Ceyhun provided a dashboard. I'm adding @nikodemas so that he is also aware of this thread. I think it is up to @jenimal to decide if information is sufficient or not

@amaltaro amaltaro changed the title Provide WMStats information to MONIT Provide workflow/job information to MONIT Sep 22, 2023
@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2023

Given the lack of activity in this ticket, I am going to close this out. However, if there is a reason to keep it open, please do so. If it's about more/new/better monitoring information, we might want to stick with a fresh new ticket as well. Thank you Jen, Federica et al.

@amaltaro amaltaro closed this as completed Dec 4, 2023
@leggerf
Copy link

leggerf commented Dec 5, 2023

fine for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants