Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alertmanager: Support optimistic concurrency for configuration updates #9383

Open
stevesg opened this issue Sep 23, 2024 · 0 comments
Open

Comments

@stevesg
Copy link
Contributor

stevesg commented Sep 23, 2024

Problem

Alertmanager has a single configuration file which contains all receivers and the routing tree. There is currently no safe way to have multiple clients reading, modifying and writing the configuration file. This poses a problem when:

  • Disparate teams want to manage a subset of the configuration file.
  • User interfaces (such as Grafana) that want to edit a single receiver or route.

It makes sense that we would want to support optimistic concurrency on Alertmanager configurations out of the box (i.e. without requiring an intermediary synchronizing configurations).

Proposal

One option is to support the standard HTTP Etag/If-Accept mechanism:

  • GET /alertmanager/api/v1/alerts: Will return an ETag with each response.
  • POST /alertmanager/api/v1/alerts: Will optionally accept an If-Match header.

The client would:

  • GET the configuration
  • Modify the configuration
  • POST the configuration, with an If-Match header
  • If Alertmanger returns 412, GET again and retry the update

Arguably, the Alertmanager configuration write API could have been a PUT, but I don’t see any need to go in depth into that discussion now, that is orthogonal.

Implementation

This is trivial to implement for GCS and Azure Storage, because they both support If-Match for PUT requests. (It would also be straightforward to find a solution for Filesystem backend). However, S3 does not support If-Match for writes, and so we’ll have to check it ourselves.

When writing configurations, and an If-Match is provided, we will need to read the current configuration, check it has the expected content, and write the new content. There is a race condition here so the checking and uploading have to be done under a lock. To do this without introducing external dependencies, we can use If-Not-Matches: * to implement a rudimentary lock using object storage, which is now supported by S3 in addition to the other providers.

  • Upload a lock object using If-Not-Matches: *. If the upload fails:
    • Check the lock object timestamp, it’s over some age threshold, delete it *
  • Retry with some back-off
  • Read the current configuration
    • If it does not match the hash passed to If-Match, return 412
  • Upload the new configuration
  • Delete the lock object

* This mechanism is needed to detect stale locks, if an Alertmanager crashes between uploading and deleting the lock object.

This implementation can be achieved with minimal changes to the object storage code, we only need a mechanism to signal “do not overwrite” when calling Upload on the bucket client. The performance and other overheads of this solution are not a concern; configurations are uploaded infrequently (worst case every might be few minutes if being actively iterated on; then a configuration might be unchanged for days, weeks or longer).

Iterative Improvements

  • Use ETag values from object storage instead of computing our own hash (saves downloading the existing configuration in full)
  • Use If-Match on object storage providers if available (hopefully S3 supports it soon).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant