Alertmanager: Support optimistic concurrency for configuration updates #9383

stevesg · 2024-09-23T13:59:32Z

Problem

Alertmanager has a single configuration file which contains all receivers and the routing tree. There is currently no safe way to have multiple clients reading, modifying and writing the configuration file. This poses a problem when:

Disparate teams want to manage a subset of the configuration file.
User interfaces (such as Grafana) that want to edit a single receiver or route.

It makes sense that we would want to support optimistic concurrency on Alertmanager configurations out of the box (i.e. without requiring an intermediary synchronizing configurations).

Proposal

One option is to support the standard HTTP Etag/If-Accept mechanism:

GET /alertmanager/api/v1/alerts: Will return an ETag with each response.
POST /alertmanager/api/v1/alerts: Will optionally accept an If-Match header.

The client would:

GET the configuration
Modify the configuration
POST the configuration, with an If-Match header
If Alertmanger returns 412, GET again and retry the update

Arguably, the Alertmanager configuration write API could have been a PUT, but I don’t see any need to go in depth into that discussion now, that is orthogonal.

Implementation

This is trivial to implement for GCS and Azure Storage, because they both support If-Match for PUT requests. (It would also be straightforward to find a solution for Filesystem backend). However, S3 does not support If-Match for writes, and so we’ll have to check it ourselves.

When writing configurations, and an If-Match is provided, we will need to read the current configuration, check it has the expected content, and write the new content. There is a race condition here so the checking and uploading have to be done under a lock. To do this without introducing external dependencies, we can use If-Not-Matches: * to implement a rudimentary lock using object storage, which is now supported by S3 in addition to the other providers.

Upload a lock object using If-Not-Matches: *. If the upload fails:
- Check the lock object timestamp, it’s over some age threshold, delete it *
Retry with some back-off
Read the current configuration
- If it does not match the hash passed to If-Match, return 412
Upload the new configuration
Delete the lock object

* This mechanism is needed to detect stale locks, if an Alertmanager crashes between uploading and deleting the lock object.

This implementation can be achieved with minimal changes to the object storage code, we only need a mechanism to signal “do not overwrite” when calling Upload on the bucket client. The performance and other overheads of this solution are not a concern; configurations are uploaded infrequently (worst case every might be few minutes if being actively iterated on; then a configuration might be unchanged for days, weeks or longer).

Iterative Improvements

Use ETag values from object storage instead of computing our own hash (saves downloading the existing configuration in full)
Use If-Match on object storage providers if available (hopefully S3 supports it soon).

The text was updated successfully, but these errors were encountered:

stevesg added the component/alertmanager label Sep 23, 2024

stevesg self-assigned this Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alertmanager: Support optimistic concurrency for configuration updates #9383

Alertmanager: Support optimistic concurrency for configuration updates #9383

stevesg commented Sep 23, 2024 •

edited

Loading

Alertmanager: Support optimistic concurrency for configuration updates #9383

Alertmanager: Support optimistic concurrency for configuration updates #9383

Comments

stevesg commented Sep 23, 2024 • edited Loading

Problem

Proposal

stevesg commented Sep 23, 2024 •

edited

Loading