-
Notifications
You must be signed in to change notification settings - Fork 107
WMStats details
This document summarizes different topics about WMStats server following issue#11411. Below is a list of action items required for design discussion:
- [DONE] current data stored in the database (including required vs likely non-required data)
- [DONE] how data gets published to the wmstats database
- [DONE] functionalities provided by WMStats (e.g., ACDC creation)
- [DONE] APIs used to load data into WMStats
- [DONE] data structure of the data loaded into WMStats
- [DONE] Weak and/or missing functionalities
- [DONE] proposal of new WMStats UI server implementation
The WMStats server is deployed at cmsweb cluster and provides data via the following URLs
# WMStats cache server
scurl -s https://cmsweb.cern.ch/wmstatsserver
# WMStats info API
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/info
# cache (most heavy) API which returns everything
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/requestcache
# job detail API example
https://cmsweb-testbed.cern.ch/wmstatsserver/data/jobdetail/tivanov_TC_6Tasks_Scratch_HG2301_Val_230109_081126_7143
The WMStats server APIs are defined in RestApihub.py
It stores and fetches the data from underlying CouchDB. The CouchDB stores unstructured or semi-structured data in JSON data-format and does not impose any schema for stored data. Therefore, the data structure of stored data is driven by external services, e.g. the WMAgent info can be obtained via this URL call:
https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl7/_view/agentInfo
while workflow information can be obtained as following:
scurl -s -X POST -H "Content-type: application/json" -d '{"keys": ["pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816"]}' "https://cmsweb.cern.ch:8443/couchdb/reqmgr_workload_cache/_all_docs?include_docs=True" | jq
The CouchDB reqmgr_workload_cache
database stores high level description of workflow, while other parts of information come from wmstats couchdb database, mostly related to workflow evolution of tasks and jobs, see
DataCacheUpdate.py.
In both cases above the structure of stored documents is defined somwwhere in WMCore code base. Said that, the structure among documents within single collection of documents remain the same, e.g. all workflows documents will have common keys/values, but not all keys are mandatory. For example, Task
sub-structure of workflow document only appear if there are tasks in given workflow, but it may be skipped if they do not exist.
The WMStats database resides in CouchDB and it is called wmstats
. This wmstats database is populated by WMAgent, either by POSTing workflow summary or by replicating job/fwjr information from the agent local CouchDB to central CouchDB. The relevant CouchDB databases on the agent side, where information is taken and posted or replicated to central CouchDB, are named wmagent_jobdump/fwjrs
and wmagent_jobdump/jobs
, where the first contains an outcome report extracted from the CMSSW FJR document; the latter contains job-related information, such as input data, lumis, etc.
On what concerns the workflow description, it is stored in the reqmgr_workload_cache
CouchDB database. Which is populated by end users (e.g. Unified, McM, users creating workflows) and also updated by some systems, like MicroServices and Global WorkQueue. The workflow JSON document comes from an initial standard spec. see StdSpecs area. Each workflow request is create via ReqMgr2 service which provides create/update methods, see Request.py. For more information about data flow in CouchDB please refer to this graph.
The WMStats web-server consumes both wmstats and reqmgr_workload_cache, and stores it all in the in-memory data cache.
The WMStats UI server provides the following functionalities:
- list all known workflows in a system along with their aggregation information such as number of processed events, lumis, failure rate etc
- current status of all WMAgents and central services
- aggregated information about Campaigns, Sites, CMSSW releases and WMAgents
- Information about failures in various requests
- information about logs of individual requests
- various search and filter capabilities about requests, campaigns, sites, releases, etc.
The wmstats server in-memory data cache code comes from DataCacheUpdate.py module which calls WMStatsReader.py
The individual data comes from different APIs, e.g. RequestDBReader.getRequestByStatus
where RequestDBReader.py
calls couchdb bystatus view.
For example (here and further we use cmsweb-test9
as an example and this URL may be directly applied to cmsweb, or cmsweb-testbed which will only provide more data):
# get list of workflows:
scurl -s https://cmsweb-test9.cern.ch/couchdb/reqmgr_workload_cache/_design/ReqMgr/_view/bystatus?include_docs=false
{"total_rows":47,"offset":0,"rows":[
{"id":"amaltaro_DQMHarvest_RunWhitelist_Agent214_Val_221216_110404_1633","key":"assigned","value":1671188646},
{"id":"amaltaro_ReReco_RunBlockWhite_Agent214_Val_221216_110415_6984","key":"assigned","value":1671188657},
...
To get agent info we'll call the following view
scurl -s https://cmsweb-test9.cern.ch/couchdb/wmstats/_design/WMStatsErl7/_view/agentInfo | jq | head -20
{
"total_rows": 9,
"offset": 0,
"rows": [
{
"id": "global_workqueue",
"key": "global_workqueue",
"value": {
"_id": "global_workqueue",
"_rev": "681-6a1bece20a50c1c56b6bea95fb29e961",
"agent_url": "global_workqueue",
"agent_team": "",
"agent_version": "2.1.6rc3",
"timestamp": 1671462286,
"down_components": [],
"type": "agent_info",
"down_component_detail": [],
...
The WMStatsWriter.py is responsible for writing data to CouchDB, as aforementioned, it's not the only way to write documents to it though (replication is the main mechanism). This class is used by different WMCore components like:
CouchDB contains many design documents, where each design document contains a small set of views. This was designed such that views can be updated and indexed in their own unix process (one unix process for each design document).
In order to understand which views are used we may find them in couchdb log
grep _design /data/srv/logs/couchdb/couch.log | grep couchdb.couchdb | grep -v logdb | awk '{print $9,$10}' | head
or, we can list of all views in particular DB: https://cmsweb.cern.ch:8443/couchdb/wmstats/_design_docs
To get db view we need:
The URI to query to get a view's result is /database/_design/designdocname/_view/viewname
For example:
scurl https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl | jq
{
"_id": "_design/WMStatsErl",
"_rev": "10-01ea14e3d6042a7e0a480e0b9e485c3b",
"language": "erlang",
"views": {
"requestAgentUrl": {
"map": "fun({Doc}) ->\n DocType = couch_util:get_value(<<\"type\">>, Doc),\n case DocType of\n undefined -> ok;\n <<\"agent_request\">> ->\n AgentUrl = couch_util:get_value(<<\"agent_url\">>, Doc),\n Workflow = couch_util:get_value(<<\"workflow\">>, Doc),\n Emit([Workflow, AgentUrl], null);\n _ -> ok\n end\nend.",
"reduce": "_count"
}
},
"couchapp": {
"manifest": [
"language",
"views/",
"views/requestAgentUrl/",
"views/requestAgentUrl/map.erl",
"views/requestAgentUrl/reduce.erl"
],
"objects": {},
"signatures": {}
}
}
Here the designdocname is WMStatsErl
, the view name is requestAgentUrl
. Each view has a map
(and sometimes a reduce
function) section of the document. The map
stores the map function which can be either written in JavaScript or Erlang (the language CouchDB is written), while reduce
provides reduce function used together with a given map, and it can be skipped.
To list all docs in given view we do:
https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl1/_view/byAgentURL
Or, we can pass filter to get specific keys:
scurl https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl1/_view/byAgentURL -X POST -H "Content-type: application/json" -d '{"keys": ["vocms0281.cern.ch"]}'
Or, we can view the grouping
https://cmsweb.cern.ch:8443/couchdb/wmstats/_design/WMStatsErl3/_view/jobsByStatusWorkflow?group=true
The actual document about given workflow which we can see from ReqMgr2 page
https://cmsweb.cern.ch/reqmgr2/fetch?rid=request-pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816
can be fetched directly from CouchDB as following:
scurl -s -X POST -H "Content-type: application/json" -d '{"keys": ["pdmvserv_task_HIG-RunIISummer20UL17NanoAODv9-11966__v1_T_221220_021143_6816"]}' "https://cmsweb.cern.ch:8443/couchdb/reqmgr_workload_cache/_all_docs?include_docs=True" | jq
The WMStats UI server gets its data from WMStats cache server. The latter provides full list of JSON docs as single list which leads to significant size of the payload data.
A single WMStats JSON record is quite large and its data structure can be seen over here. It consists of various attributes describing current workflow state, e.g. its name, prepID, Tasks, etc. All attributes follow CamelCase naming convention except AgentJobInfo which mixes CamelCase with underscores naming conventions, e.g.
"AgentJobInfo": {
"cmsgwms-submit6.fnal.gov": {
"_id": "cmsgwms-submit6.fnal.gov-cmsunified_ACDC0_task_EXO-RunIISummer20UL18wmLHEGEN-00824__v1_T_220425_125613_9010",
"_rev": "274-4282735881",
"agent_team": "production",
"agent_version": "2.0.2.patch1",
"agent": "WMAgent",
"agent_url": "cmsgwms-submit6.fnal.gov",
"type": "agent_request",
"workflow": "cmsunified_ACDC0_task_EXO-RunIISummer20UL18wmLHEGEN-00824__v1_T_220425_125613_9010",
"status": {
"inQueue": 2
},
...
The production WMStats data is proportional to the number of active requests - requests not archived - in the system, but it can drastically change with failure rate as well, making it hard to project the required cold storage and memory allocation for this service.
The WMStats server REST endpoint requestcache
provides all the data cached in memory:
scurl -s https://cmsweb.cern.ch/wmstatsserver/data/requestcache
An example of metrics for this endpoint, with 11k active requests in the system, is:
- it takes 16 seconds to transfer 50.2MB of data over the network, allocating 251MB worth of data in disk (in JSON format); or
- or it takes 24 seconds to transfer 6.5MB of data over the network, allocating 21MB worth of data in disk (in GZIP format).
Loading such large JSON into RAM will require at least 2-3 times of JSON size, regardless of underlying language. Therefore, such HTTP call not only blocks the FE, but also introduces significant overhead on client side (WMStats UI). Therefore, it is desired to modify WMStats server to provide the following:
- Change its requestcache API to return subset of data using idx/limit, a.k.a pagination, e.g. return only 10 records
- change requestcache API to provide ndjson format to support data streaming
- allow gzip encoding in HTTP request
Also, the WMStats UI server needs to show aggregated data such as job, event, lumi progress, failure rate, etc. These metrics are calculated via JavaScript functions residing in WMStats UI server (client's browser), see WMStats/_attachments/js/Views area for all functions. Such calculations are performed all the time on WMStats UI page which leads to additional latencies (which grows with number of existing workflows in a system). It would be more appropriate to shift this functionality to either WMStats cache server or introduce appropriate views within CouchDB, and cache them in WMStats cache server. This will allow to have lightweight implementation of WMStats UI server which will only fetch and display the data.
You may find full proposal about new implementation of WMStats UI server over here