Skip to content

Monitoring of CMSWEB services with Prometheus AlertManager

Alan Malta Rodrigues edited this page Sep 22, 2021 · 4 revisions

Central services that run in the CMSWEB Kubernetes cluster are monitored with Prometheus, either via standard exporters like for process monitoring, for couchdb, etc; or via custom CMS monitoring scripts such as the liveness probe k8s service.

Prometheus and these exporters are fetching node and services metrics, which are then made available in a centralized database (elastic search?), and those metrics are constantly evaluated with the service rules defined in the CMSKubernetes repository. Further information has been provided by the Monitoring team HERE.

Updating rules for a given service

Whenever we want to update the Prometheus/AM based rules and alerts, changes must be provided to the CMSKubernetes repository. There are two files that need to be considered:

  • your_service_name.rule: which contains the rule definition, the conditions to trigger an alert, the alert definition itself, and a time interval in which the rule needs to be evaluated
  • your_service_name.test: a unit test for your rules

Once these changes have been made, we should check the rule definition and also test it with our unit test file. For that, a promtool has been made available and deployed in CVMFS. In order to test our rules definition, we can run it like:

amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool check rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules 
Checking kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules
  SUCCESS: 4 rules found

and to run the unit tests we have defined:

amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool test rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test 
Unit Testing:  kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test
  SUCCESS

Once everything is looking well on our side, we make a pull request for the CMSKubernetes repository, and ask the HTTP team to deploy these changes to CMSWEB.

Clone this wiki locally