-
Notifications
You must be signed in to change notification settings - Fork 107
WMCore developer responsibilities
NOTE: this is a non-exhaustive list of responsibilities, and given that we work in a large collaboration, we should definitely help others when it comes to our expertise area.
Besides software design, development, testing, integration and trouble shooting; a WMCore developer needs to carry out other activities such as:
- support central production activities and make sure the CMS Computing needs are met;
- deploy central services and WMAgent and ensure the their stability and availability;
- integrate the WM system with any other technologies being adopted by CMS (including the DM system);
- etc.
Expanding a bit on some of those activities:
- on Mattermost:
- be part of the CMS O&C team; and join at least the "Computing Operations", "Core Software", "DMWM" and "Submission Infrastructure" public channels. If we are needed there, we usually get tagged though;
- join the "WM toolset Project" and "WMStats remake Project" private channels; where we hold a weekly chat with some students;
- on Slack:
- be part of the "CMS DMWM" group and join most of those channels;
- be part of the "P&R" group and join at least: "general" and "wmcore-support" channels;
- be part of the "cms-pdmv-pnr-ops" group and join at least: "rapid-communication" channel ("data-issue" might be required too)
- on GitHub project boards:
- an overview of the nodes available with WMAgent and their statuses can be found in this Agents status GitHub board. It lists all the agents used for central production, including RelVal and HEPCloud agents (and actually CMS@Home, which is not yet part of the production effort).
- another important board to keep up-to-date concerns the status of the latest WMAgent stable releases, including the patch releases made available and which patches that needed to be applied to the production agents. More information in the WMAgent recent releases GitHub board.
- deploy and maintain the CMS@Home WMAgent on vocms0267 (with special resource-control and credentials; connected to testbed);
- deploy and maintain production agents at Fermilab;
- deploy and maintain production agents at CERN;
- deploy and maintain a single RelVal agent at CERN;
- create WMCore tags and update the cmsdist spec files under the "comp_gcc630" branch. Request deployment in CMSWEB in a timely manner; properly validate services in pre-production; announce any breaking changes and/or new features needed by other groups; provide validation results ~3 days before the production upgrade;
- meetings:
- attend the T0 meeting on Mondays at 3:30pm CEST;
- attend/call the WMCore meeting on Mondays at 4pm CEST, preparing the google document and adding it to the indico entry;
- attend the Computing Operations meeting at 5pm CEST, filling the Workload management report;
- attend the Rucio transition meeting on Tuesday at 4pm CEST;
- attend the O&C meeting on Wednesday at 3pm CEST;
- attend the P&R meeting on Wednesday at 4pm CEST;
- attend the P&R/WMCore dev meeting on Friday at 2pm CEST;
- attend the WMCore/WMControl/CMSSW meeting on Friday at 3pm CEST.
- monitoring:
- in short, we should make sure that there are enough agents (usually >=4) up and running, both CERN and FNAL based. Pending jobs is not supposed to go lower than 200k. For the running state, we should always look at the running cores (which is much more meaningful than running jobs), and the expected load is not supposed to go lower than 150k production cores. For the short term monitoring (last 2 days or so), one case use the CMS Job Monitoring Grafana dashboard.
- we should keep an eye on the overall WM system monitoring as well. Which provides an overview for the ReqMgr2 requests, Global WorkQueue, expected jobs in the grid, status of the agents (same as WMStats), and so on. For that, we can use the CMS WMAgent Monitoring Grafana dashboard.
Another very important task to be carried out concerns the stability of the production WMAgents. In case of issues with central production - commonly originating from production workflow failures and unexpected behaviors - we need to have a basic debugging of the problem, if we suspect there is a problem in WMCore, a GitHub issue should be created right away and its priority has to be evaluated according to the overall impact. Blocker issues get the "highest priority", which means we stop whatever tasks we have been working on and focus only on that.
Once a fix has been made against the master branch (CI jenkins results are satisfactory), and the patch has been tested in one of our pre-production agents, we need to backport that fix to the WMAgent branch in WMCore (e.g. 1.3.6_wmagent
) and patch all the production agents. Keep in mind that (some) components need to be restarted to generate a new python byte-code.