Skip to content

Monitoring of CMSWEB services with Prometheus AlertManager

Alan Malta Rodrigues edited this page Oct 6, 2021 · 4 revisions

Central services that run in the CMSWEB Kubernetes cluster are monitored with Prometheus, either via standard exporters like for process monitoring, for couchdb, etc; or via custom CMS monitoring scripts such as the liveness probe k8s service.

Prometheus and these exporters are fetching node and services metrics, which are then made available in a centralized database (elastic search?), and those metrics are constantly evaluated with the service rules defined in the CMSKubernetes repository. Further information has been provided by the Monitoring team HERE.

UPDATE: some extra rules - for instance, for Central CouchDB services - can also be found in THIS Gitlab CMSMonitoring repository

Updating rules for a given service

Whenever we want to update the Prometheus/AM based rules and alerts, changes must be provided to the CMSKubernetes repository. There are two files that need to be considered:

  • your_service_name.rule: which contains the rule definition, the conditions to trigger an alert, the alert definition itself, and a time interval in which the rule needs to be evaluated
  • your_service_name.test: a unit test for your rules

Once these changes have been made, we should check the rule definition and also test it with our unit test file. For that, a promtool has been made available and deployed in CVMFS. In order to test our rules definition, we can run it like:

amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool check rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules 
Checking kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules
  SUCCESS: 4 rules found

and to run the unit tests we have defined:

amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool test rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test 
Unit Testing:  kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test
  SUCCESS

Once everything is looking well on our side, we make a pull request for the CMSKubernetes repository, and ask the HTTP team to deploy these changes to CMSWEB.

Clone this wiki locally