-
Notifications
You must be signed in to change notification settings - Fork 107
Monitoring of CMSWEB services with Prometheus AlertManager
Central services that run in the CMSWEB Kubernetes cluster are monitored with Prometheus, either via standard exporters like for process monitoring, for couchdb, etc; or via custom CMS monitoring scripts such as the liveness probe k8s service.
Prometheus and these exporters are fetching node and services metrics, which are then made available in a centralized database (elastic search?), and those metrics are constantly evaluated with the service rules defined in the CMSKubernetes repository. Further information has been provided by the Monitoring team HERE.
Please check the CMSMonitoring documentation HERE for more information on these alerts, where the rules are stored, and how to check which rules are enforced on the Prometheus server.
Whenever we want to update the Prometheus/AM based rules and alerts, changes must be provided to the CMSKubernetes repository. There are two files that need to be considered:
-
your_service_name.rule
: which contains the rule definition, the conditions to trigger an alert, the alert definition itself, and a time interval in which the rule needs to be evaluated -
your_service_name.test
: a unit test for your rules
Once these changes have been made, we should check the rule definition and also test it with our unit test file. For that, a promtool
has been made available and deployed in CVMFS. In order to test our rules definition, we can run it like:
amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool check rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules
Checking kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules
SUCCESS: 4 rules found
and to run the unit tests we have defined:
amaltaro@lxplus751:~/CMSKubernetes $ /cvmfs/cms.cern.ch/cmsmon/promtool test rules kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test
Unit Testing: kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.test
SUCCESS
Once everything is looking well on our side, we make a pull request for the CMSKubernetes repository, and ask the HTTP team to deploy these changes to CMSWEB.
In some use cases it is much more useful to fetch all the alerts as uploaded to monit in the form of a .json
file rather than working with the monitoring page itself (e.g. https://cms-monitoring.cern.ch/alertmanager/#/alerts?receiver=dmwm-admins&filter={service%3D%22ms-rulecleaner%22} )
For this, any machine behind the CERN firewall can be used with standard tools for making the HTTP calls. Here is an example of fetching all alarms produced by a particular service and sent to a single group:
curl -o MSRulecleanerAlarms.json http://cms-monitoring.cern.ch:30093/api/v2/alerts\?receiver=dmwm-admins\&service=ms-rulecleaner