Consuming polly packages #14

povilasv · 2021-06-15T08:53:15Z

povilasv
Jun 15, 2021

Hey,

Maintainer and creator of a couple of mixins here, have some questions around polly packages.

I'm interested in a consumer use cases here:

How would one add a team label to all alerts?
How would delete an alert from polly package?
How would one add your own alert?

I think these should be trivial for consumers to do, so I'm interested how CUE code for these would look like? Is it easier to do compared to jsonnet?

sdboyer · 2021-06-17T02:27:48Z

sdboyer
Jun 17, 2021
Maintainer

This is an awesome question. So awesome that i've already written a thousand words over the past two days. I plan to finish tomorrow, before i head out on vacation for a week :)

0 replies

sdboyer · 2021-06-25T10:38:00Z

sdboyer
Jun 25, 2021
Maintainer

Sorry, didn't manage to finish before vacation! But, fair to say that ruminating on it and typing it all from the passenger seat of the car in bug-ridden wildernesses of Michigan (thanks to my wife for putting up with that!) has made it a richer response. Here we go!

I'm interested how CUE code for these would look like?

Fantastic question, thank you! Don't have a totally concrete answer yet. Here's what's clear so far.

To start, a note: because we're trying to draw a clear line between what defines the pop, vs. how the pop is consumed (including what tooling you use to consume it), the answer is always going to be "it depends on what tool you want to use to consume it." Of course, that's a useless answer, and we plan to make at least some simple consumption tools.

With that in mind, i'll talk about each case in the context of some four consumption tools that are relatively easy to imagine:

A simplistic CLI tool (intended for testing and quick checks, not production devops worklows, that basically just shoots objects at HTTP APIs)
Through an existing, jsonnet-centric workflow (Then consumed by e.g. Tanka or ArgoCD)
Through a hypothetical CUE-driven toolchain (Consumed by future CUE support in Tanka or ArgoCD, or something like Dagger)
Via Terraform

I'll elaborate on what's involved in each approach as we go through each question. Most of the complexity is not in the operations you're asking about, but rather in how the tool logically connects to the pop in order to flow through such choices from the user.

How would one add your own alert?

There's an important answer to start with: it's out of scope. Polly isn't trying to be an exhaustive source of every alert you'd want - nor were mixins before them. As such, they can't possibly be the only way you're defining alerts - so add another with one of the other mechanisms that necessarily exist. Nothin' to do with Polly.

But this answer evades what i suspect is the spirit of your question: "pops, and mixins before them, define a logical grouping of alerts related to a single system. Prometheus necessarily groups alerts, and it seems tidiest to me that i would keep all the alerts for a particular system logically grouped together, whether they originate from the upstream pop or are something i've added as the consumer of the pop." So, i do appreciate why that's not very satisfying.

So, while control for this is out of scope for anything in the polly schema, it's also the goal that the schema at least not make it difficult. The relevant issue there would be #16, where i raise the concern about separating the definitions of rules/alerts from the groups that Prometheus expects to contain them. Certainly the scuemata for alerts will be for individual items, and we'll probably want a map of individual rules/alerts for what's in PollyPackage itself, which itself is contained in a map keyed by group name. (That'd replace this pseudocode.) There's a different idea i want to explore for group name, but it would derail this discussion too much.

That will have implications for the responsibilities placed on consumers: because Prometheus itself expects the rules/alerts to come in the form of map of group name to list of rules/alerts, it will be incumbent on the consumer to translate from polly's datastructure to what Prometheus expects. (Though we may provide helpers in the polly repo that would be useful for CUE-centric approaches; more on this below)

Let's look at each of the four consumer approaches. (This first question is going to be pretty huge, because there's a lot of back-context to fill.)

Simplistic CLI

Almost certainly unsupported. Defining whole new objects on the fly will be well clearly out of scope for such a tool. I could imagine pointing the CLI to an arbitrary YAML file with alert definitions, in addition to providing it the name of a pop to install, but that really feels like scope creep.

Jsonnet

We've kicked around a few ways of enabling this. Figuring out this path is high priority on the roadmap - it amounts to backwards compatibility for existing mixin users. All possible approaches go through a mechanism for translating pops into jsonnet on the fly. The big trick with that will be translating package-level parameters into the standard jsonnet parameters _config map, as well as the corresponding references to those parameters distributed throughout the pop. We'll also likely need to "compile out" any signals defined in the pop.

That should, at least for most cases, be sufficient to map the structure of a pop to how existing jsonnet code expects a mixin to be shaped. (That probably implies that the automated translation will need to do the aforementioned map->list transform). And the more precise and constrained we are about what's allowed to be expressed in pops, the easier and more general that translation will be.

In building out that automated translation process, our goal will be that existing jsonnet code for consuming mixins will work, as-is. There's nothing obvious right now suggesting that can't be the case, though I could fill a short novel with the details i/we don't yet know, so take that with a grain of salt. For now, though, i'd imagine that whatever existing jsonnet mechanism you have that adds in an alert to the rule group should continue to work. (Or, along the preferable path, by adding more alerts to a different group.)]

I don't know exactly how well jb continues to work here for package management of the CUE pop, as i'm not familiar with jb's assumptions about having jsonnet files in dependency packages. It could be an ideal place to put automated translation - as in, vendor contains the results of CUE->jsonnet translation.

CUE

There's two fundamental ways i can readily imagine consuming a pop from CUE:

As you would with any other CUE: import ing the CUE package directly in your infra/CUE files, relying on CUE package management (still a WIP) to do all the usual dependency management-y things.
By writing a string reference to the URIs

The first looks the most like the jsonnet workflow. The actual CUE code of the pop you're working with is loaded into memory alongside your CUE code by the cue runtime, simply by following import statements. In this approach, it's at least plausible that the vanilla cue binary could render the objects in the pop to JSON/YAML, at which point effectively any other deployment tool can pick it up. Your CUE code may look a bit like this:

import "github.com/org/system/pop"

// Pull the upstream pop def into local field
popadd: pop.pop 

// Body of the alert you want to add in the curly braces
popadd: prometheusAlerts: v0: nameOfGroup: nameForAlert: { }

(It's now up to the rest of the system to decide how the popadd object makes it out to the actual deployment mechanism)

In the second approach, we're operating on a string reference to the URI of the pop, rather than the actual CUE code. That means the cue binary won't be useful, because it doesn't (and shouldn't) know how to resolve that kind of domain-specific reference. Instead, you're writing CUE code that's intended to be consumed by the binary of some CUE-based framework - like dagger, or the mini-framework in the grafana-cli cue subcommand - that knows how to deal with such references. (AIUI, supporting such frameworks is increasingly the direction CUE is trying to support.)

Because the actual pop is no longer directly accessible to your CUE code, we have to adopt a different approach. Think of this as the difference between directly, imperatively mutating an object, and defining a deferred action/closure that will mutate such an object later. (To CUE, there is no difference between these - it's all unification, and CUE is order-independent.)

We can create helpers for these, and they probably belong in the polly repo for convenience - a bit like a stdlib. I won't go into implementation details of such helpers, but the code you write would look a bit like this:

import (
    "github.com/pollypkg/polly/ref"
    "github.com/pollypkg/polly/util/prometheus"
)

// NOTE: It's up to the framework to define how 'popadd' gets matched to the actual
// pop. That's a separate, interesting topic, but answers from just what's here (uri)
// aren't hard to imagine.
popadd: ref.PopRef & {
    uri: "popregistry.io/org/system" // URI to where the pop lives in a registry
    version: "v1.0.0" // Version of the pop, as known to the registry
    // NOTE: This omits versioning of the alert objects themselves, but scuemata makes all that tractable
    transform: [prometheus.AddAlert & {
        name: "nameOfAlert"
        group: "nameOfGroup"
        alert: { } // Body of the alert you want to add
    }]
}

Note that it is entirely due to having schema for the objects in Polly - and being able to enumerate the set of those objects - that allows any of this to be possible/for different CUE frameworks to interop with the same helpers. Also, this kind of formulation is what i think was originally imagined with the term "mixin" - a standalone, referenceable declaration that adds components to other objects.

There's two key properties we get out of this (particularly vis-a-vis jsonnet): provenance and composition/decoupling.

While Prometheus/Alertmanager isn't going to know or care about where the alert object that ultimately lands on it came from, the nature of CUE guarantees that whatever assembles the final object from both the upstream pop and this deferred injector is able to trace back the values from the final object to the exact files and line numbers on which those values were declared. It can chase these declarations across an arbitrarily complex number of injectors, in any possible arrangement of files - and that information could easily be made available when the alert fires.

The benefits of the compositional aspect of this - where, instead of the pseudo-inheritance of the first approach (and jsonnet), we define these standalone injector objects that can be recombined later - may be slightly less obvious. But consider what's possible: when we're released from the constraints of needing to "import jsonnet, override, re-import, re-override" in various logical layers, we can create repo/filesystem arrangements that are actually easy, even intuitive to maintain. The cost of such composition has historically been indecipherable provenance of final objects - but we already covered that 🙂

Terraform

I'm gonna gloss over a lot of the potential complexity we face with Terraform, mostly because it's going to be deeply intertwined with what we do about parameters, and that's another significant discussion item i need to get written up. Let's just handwave and broadly imagine that we've created a Terraform provider which allows us to define a polly resource that starts with the URL for a polly package, and then exposes:

The parameters of the pop
The ability to add/remove/modify objects defined in the polly spec

For working on just a single pop with things like adding/removing an alert, this is reasonably straightforward: everything's going to be easily contained in a single terraform module, and it can be the same one responsible for deploying the system under observation. All the same parameters are necessarily going to be available, as it's all the same module. So, basically: adding another alert is just adding another stanza in your Prometheus module.

What is more complicated is that your e.g. Grafana tf module is not going to be the same as your Prometheus tf module. Clearly, we can make something work so that only one type of resource (Grafana dashboards, Prom alerts/rules) actually get handled by a given module. Now, i can easily see that creating some pretty nasty coordination requirements across a sea of tf modules (are all of your uses of "popregistry.io/org/system" the same version across all modules?), though it should still all be possible. I'm not enough of a Terraform expert to know what the proper next steps

My coworker @pkolyvas has gone more into depth prototyping these internally at GL recently; i'll see about getting some of the handwavy snippets in this direction posted here.

How would one delete an alert from polly package?

This is more straightforward than adding an alert, as it has none of the weirdness around "why does this alert need to be grouped with the ones from the pop?" There's some particular alert in the pop and we don't want it. For discussion, let's call it UnwantedAlert.

There's two basic ways we can conceive of "deletion" of an alert - or really, any object type. The first is removing the actual alert object; the second is retaining the object, but disabling it in some fashion. Let's refer to these as "removing" and "disabling," respectively.

Disabling could be preferable to removal in general because of provenance - if UnwantedAlert is simply absent from the end product, it may be quite difficult to know what caused its removal, particularly if that removal could have come from multiple places, by multiple mechanisms. If it's still present, though, it's much easier - at least in CUE-centric toolchains - to know exactly where that choice was made.

On the flipside, disabling turns the notion of existence from a simple binary (the alert either exists, or not) to a gradation (exists, disabled, not exists). When we get into dependency analysis - e.g., "Prometheus rule X must be installed in order for the data depended upon by alert Y to exist" - such gradations can have combinatorially explosive effects on the complexity of the task.

Simplistic CLI

Alert removal is easy to imagine in a CLI tool.

Once we have the map of alerts per group, rather than the list, it's trivial to imagine how this works: a parameter on the CLI tool that allows the user to indicate a particular alert should not be used:

$ pollyctl install 'popregistry.io/org/[email protected]' --omit 'prometheusAlerts.nameOfGroup.UnwantedAlert'

Some thought will need to be given around delete vs. disable. Delete semantics are universal - there's an object in the pop, and i don't want it in my infra, so filter it out from the set of objects being installing/applying them. Disabling semantics are trickier, because they're necessarily specific to the type of config object - that object must have some property that can be flipped on which its executing system will ultimately interpret as "ignore me."

Jsonnet

As with the addition of an alert, alert removal is something that we hope will be viable through whatever mechanisms you use today, assuming we're able to get automated translation to work well. If the disabling path is possible for the particular type of config object, then it'll of course be possible to set any such values with jsonnet.

CUE

Removing fields is actually one of the rougher things in CUE. It's not impossible, but it runs against the language grain; see this issue. It's complicated enough that it would be pretty difficult to construct a trivial example of that first CUE approach (direct import/jsonnet-ish).

The second, compositional approach where we offer helpers would be more straightforward for the user (even if, under the hood, we have to do the kind of ugliness in that issue). Something like this:

import (
    "github.com/pollypkg/polly/ref"
    "github.com/pollypkg/polly/util/prometheus"
)

poprm: ref.PopRef & {
    uri: "popregistry.io/org/system"
    version: "v1.0.0"
    transform: [prometheus.RemoveAlert & {
        name: "UnwantedAlert"
        group: "nameOfGroup"
    }]
}

Something like this approach could also work for flipping on a flag to disable the object. That may also merit a different treatment, though - flipping such a flag is more akin to setting any arbitrary value on a config object than manipulating the set that contains the object. That's a reflection of my earlier point about the complexity inherent in having "gradations of existence."

Terraform

Given the previously-described terraform provider, the disable path is clear - it's just another parameter on an alert that can be set by the consuming module author. I've never written a provider, but i suspect the removal path is also feasible - the provider would just have to create some new type of resource that, if defined by the user, effectively removes the alert from the set that Terraform considers.

How would one add a team label to all alerts?

The goal here is to ensure the presence of a k/v string pair within a set of objects at the same path, within a particular pop.

Simplistic CLI

Supporting this kind of operation in an intentionally simplistic CLI is probably out of scope, as this kind of operation - applying a specific logical operation across a set of values from stdin-based input - would either require an explosion of possible arguments to the binary, or some sort of querying DSL. CUE would be the simplest DSL to use, and at that point, just use the CUE-based framework.

Jsonnet

Same refrain as before: if we assume that we can automatically translate the pop to jsonnet, then all your existing logic for doing this should still work.

CUE

In the inheritance-based approach, i think this would work - i haven't smashed two templates (the one here, and the one the Polly spec defines) together before. But it's all still unification, so i don't see why not:

import "github.com/org/system/pop"

// Pull the upstream pop def into local field
popapply: pop.pop

// This is a CUE template, which unifies this struct with all structs
// (alerts, per Polly spec) in that position.
popapply: prometheusAlerts: v0: nameOfGroup: [string]: {
    labels: team: "myteam"
}

The compositional approach would probably look pretty similar (though again, helpers would probably be worth creating):

import (
    "github.com/pollypkg/polly/ref"
)

popapply: ref.PopRef & {
    uri: "popregistry.io/org/system"
    version: "v1.0.0"
    apply: prometheusAlerts: v0: nameOfGroup: [string]: {
        labels: team: "myteam"
    }
}

The framework needs merely unify the apply field with any pops that match the criteria given in popapply, and the same effect is achieved.

Terraform

Terraform definitely couldn't do this before HCL's looping control structures were introduced (v0.12.0). Prior to that, there was no native way of expressing "for all" - you'd have to enumerate each alert resource to which you wanted to add the team label.

I think it can do it now, though i haven't used those capabilities myself, so i'm not entirely sure. I imagine, though, that the approach would be to loop over the set of resources in the desired alert group and ensure the appropriate team: "myteam" parameter is set.

Addendum

I want to expand this question to additionally include a harder one - this same operation, but across multiple pops. This is worth considering because i think it illustrates a kind of scaling that Polly+CUE-centric toolchains will uniquely enable.

There are three distinct but functionally related goals to consider, here:

The need to configure and deploy a particular system
The need to configure and deploy the polly package for observing that system, depending on the version of that system
The need to apply rules/constraints/policies across the pops for that system, and others

Our high-level vision for Polly explicitly talks about the first two, but the third is also key in any real org, where even spinning up a basic stack for an early startup often means provisioning a pile of software (Hi, k8s). The ability to precisely and maintainably work across the entire set of pops in use is prerequisite to getting everything you can from your Obs at not just a team level, but for the org as a whole. So, quickly revisiting in each of the four toolchains...

All that's required for the compositional CUE-based approach based on pops and snippets of pop-modding CUE logic is that, when the deployment system goes to deploy some particular system, it a) knows the pop that's associated with that system and b) can enumerate all of the snippets defined in your infra. It can then apply a matching algorithm to decide which of those snippets apply and perform composition for those that do - while still retaining provenance. The simplistic matcher example would be URI-based, alluded to above, but i talked a bit on the mailing list about how something more like CSS is conceivable. This affords the flexibility to use essentially any filesystem layout for your infra-as-code - so you can maximize for ergonomics.

The simplistic CLI couldn't even handle the case you described, so it's got no prayer for this expanded one.

For both Jsonnet and Terraform, modifying the pop is tightly coupled with consuming it. The only place where it's really conceivable to make modifications to the pop is the at that point of consumption. That's perfectly workable for the first two cases, as we've seen in the examples throughout this post, but also trying to enable the third gets quite a bit more challenging.

This problem is most obvious in Terraform. We've discussed how it can serve cases 1 & 2 in a single module, but once we try to share decisionmaking over the pop between that module and some external policy-setter, problems arise. Terraform really struggles with anything beyond basic parameter-passing between modules. It's why there are scant few reusable Terraform providers/modules with transitive dependencies. It seems somewhere between impossible and unmaintainability complex to try to use module variables/parameters to control things like "apply a team label across all alerts in the pop in this module." To achieve "policy" across multiple modules, you'd probably need to write some kind of codegen tool that spits out variables.tf files per module in order to achieve the desired effect - and that process is opaque, so provenance is gone.

Jsonnet doesn't have the same multi-object compositional friction points as Terraform (AIUI). But its central operation is the (provenance-destroying) merge, and the natural association with inheritance hierarchies - import the pop, merge/override, merge/override. The linearity of this path makes it difficult to imagine how policy could be effectively mixed in. I can't say definitively that a matching system similar to the CUE one outlined above couldn't be created, but i suspect it would trip over the difficulty of introspecting incomplete, unevaluated jsonnet to automate crucial decisions (e.g. is there a conflict between this snippet and some jsonnet-pop representation?)

So, finally...

Is it easier to do compared to jsonnet?

It's kinda apples-and-oranges (and subjective), so i'm not really sure how meaningful comparison i can make. But, in general, there tend to be fewer characters you have to put in a file when overriding something than you do when composing pieces together. That's not necessarily the right measure for difficulty, but still, my guess is that the jsonnet approach will seem easier, at least for a while. And that's of course gonna be the case for people who are already familiar with jsonnet.

My belief, though, is that this is the path that's certainly most scalable, and will in the long and medium term be easiest for people, because they'll be able to reason about the effects of their choices. I have quite a few colleagues at Grafana Labs who are filled with trepidation whenever they go to modify at least certain parts of our jsonnet, because it has grown into an unwieldy beast, and even experts in jsonnet and the codebase have difficulty understanding and predicting the effects of the changes they make.

0 replies

justinTM · 2021-12-10T21:04:18Z

justinTM
Dec 10, 2021

hey @pkolyvas , @sdboyer mentioned you were working with Terraform+Polly. would you be willing to discuss/share what you've been working on? sounds immediately similar to what we're going to attempt here soon

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consuming polly packages #14

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Consuming polly packages #14

povilasv Jun 15, 2021

Replies: 3 comments

sdboyer Jun 17, 2021 Maintainer

sdboyer Jun 25, 2021 Maintainer

How would one add your own alert?

Simplistic CLI

Jsonnet

CUE

Terraform

How would one delete an alert from polly package?

Simplistic CLI

Jsonnet

CUE

Terraform

How would one add a team label to all alerts?

Simplistic CLI

Jsonnet

CUE

Terraform

Addendum

justinTM Dec 10, 2021

povilasv
Jun 15, 2021

sdboyer
Jun 17, 2021
Maintainer

sdboyer
Jun 25, 2021
Maintainer

justinTM
Dec 10, 2021