-
Notifications
You must be signed in to change notification settings - Fork 107
FAQ frequently asked questions
This wiki is meant to contain the most common questions and answers related to the Workload Management operations.
Just a reminder about the usual monitoring tools though:
- WMAgent monitoring: https://monit-grafana.cern.ch/d/lhVKAhNik/cms-wmagent-monitoring?from=now-2d&orgId=11&refresh=5m&to=now
- WMCore Workflow monitoring: https://cmsweb.cern.ch/wmstats/index.html
- CMS Job monitoring: MonIT_JobMonitoring
- Job/Condor pool monitoring: https://cms-gwmsmon.cern.ch/
- Production Condor pool summary: http://cms-htcondor-monitor.t2.ucsd.edu/letts/production.html
While there is no clear answer for such question, there is likely enough monitoring information to get to a conclusion.
From the monitoring links above, one can check the Condor pool summary link, go to the Site Table:
table and check the last row of the IdleCpus
column. Right now the value is 3723, so there are 3723 cpus that are free in the system, and the likely reason they are not used comes from the fact that (some) workflows are not properly dimensioned, sometimes taking more memory than the usual 2.5GB/core.
The WMAgent monitoring also has some interesting plots on this respect, especially those for "GQ elements by priority", for instance this one, which shows thousands of GQEs in Available above the 80k priority. This would answer why 80k workflows are not going through as well.
Final note: if you still think there might be a problem in the system. You can always pick one workflow and bump its priority to the highest in the system. If it does not get jobs running in a couple of hours, then there is a high change to have a problem in the WM system (provided the site is up & running).
Why the workflow is configured to request X GB of Memory while jobs in condor request something different
One of the ways to answer it would be by looking at the job classads and check whether the job has been tuned or not. Another possibility, could be that you're not looking at the correct place, because the memory requirements can be overwritten during the workflow assignment (or any time before the workflow gets assigned).
The most reliable way to check the workflow/task memory requirements is through the following link (replace the workflow name by the one you want to look at):
https://cmsweb.cern.ch/reqmgr2/config?name=cmsunified_ACDC0_task_SMP-RunIISummer15GS-00286__v1_T_200408_083701_9533
then look for the keyword memoryRequirement
. It will give you the memory requirements for every single task in the workflow (in this case it's 4GB for all of them).
For most of the cases, the JSON tab/view can also be used, e.g.: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_SMP-RunIISummer15GS-00286__v1_T_200408_083701_9533 but it can be tricky in the sense that the parameter appears multiple times and, one has precedence over another.