Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize metrics with labels #2162

Merged
merged 28 commits into from
Mar 1, 2024
Merged

Conversation

lambdanis
Copy link
Contributor

@lambdanis lambdanis commented Feb 28, 2024

So far metrics with labels were registered but not initialized at startup. This leads to unpredictable resources usage and limits certain registry operations, e.g. linting or autogenerating docs.

This PR initializes metrics with known labels. It required explicitly defining known labels, so there are lots of commits doing just that - defining types for label values. Let me know if you want me to squash some of these commits. I also used InitMetrics functions as a notebook and left some comments for the future about metrics naming - hope that's ok.

A few remarks about the impact and limitations of this approach:

  • Initializing metrics at startup means increased initial resources usage. However, it also means that metrics volume is more predictable and can be planned for - that's good.
  • I changed some (not many) label values to be more useful. These are technically breaking changes. Details are documented in relevant commit messages.
  • Some metrics have labels that depend on the environment: process metadata (namespace, workload, pod, binary), policy name, hook point. These metrics are not initialized.
  • Some metrics have known labels, but are registered as custom collectors. These are also not initialized.
Metrics with known labels values are initialized to 0 on startup.

This helps to ensure stable resources usage and metrics queries. This also involves changes in several metrics labels:
* error_type label on tetragon_handler_errors_total metric is either "unknown_opcode" or "event_handler_failed" instead of the Go type of the error
* event_type label on tetragon_event_cache*_errors_total metrics is one of the values defined in Tetragon API (tetragon.EventType) instead of the Go type of the event
* error label on tetragon_event_cache_errors_total metric is "nil_process_pid"
* error label is removed from tetragon_policyfilter_metrics_total metric

@lambdanis lambdanis added area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change labels Feb 28, 2024
@lambdanis lambdanis requested a review from a team as a code owner February 28, 2024 13:11
@lambdanis lambdanis force-pushed the pr/lambdanis/metrics-init-labels branch from 0c08d05 to 5302f27 Compare February 28, 2024 14:03
Comment on lines +23 to +29
// NOTES:
// * Rename process_loader_stats metric (to e.g. process_loader_events_total) and count label (to e.g. event)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to link GH issues here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think so, just I'm not 100% sure which of these points are worth implementing and which not. Many of these are technically breaking changes, but also many metrics are probably used only by Tetragon developers so 🤷‍♀️ I need one more pass over the metrics and maybe another pair of eyes, then I'll open issues for improvements.

Comment on lines +44 to +45
func (e ErrorType) String() string {
return errorTypeLabelValues[e]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a fallback here for missing types to avoid panics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't panic, in case e is not in errorTypeLabelValues it will return an empty string (zero value).

pkg/api/ops/ops.go Outdated Show resolved Hide resolved
}

func (t CacheEntryType) String() string {
return cacheEntryTypeLabelValues[t]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in errormetrics.

}

func (t MergeErrorType) String() string {
return mergeErrorTypeLabelValues[t]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as errormetrics.

}

func (s Subsys) String() string {
return subsysLabelValues[s]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as errormetrics.

}

func (w Watcher) String() string {
return watcherTypeLabelValues[w]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as errormetrics.

@willfindlay
Copy link
Contributor

LGTM, I just left a few nits/comments. Feel free to take them or leave them.

@lambdanis lambdanis force-pushed the pr/lambdanis/metrics-init-labels branch from f70a3ee to cebc21f Compare February 29, 2024 23:20
This commit is a preparation for further metrics improvements. There are no
exciting changes here, only adding some comments about metrics naming to track
further work.

Signed-off-by: Anna Kapuscinska <[email protected]>
The label values are now retrieved via ErrorType.String() method.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
This is done so that we can conveniently iterate over all known opcodes. The
OpCode.String() method also reads from this map.

Signed-off-by: Anna Kapuscinska <[email protected]>
Signed-off-by: Anna Kapuscinska <[email protected]>
Before, error_type label on tetragon_handler_errors_total metric was set to the
Go type of the error that occurred. This is suboptimal for a few reasons:
* Go type of the error isn't necessarily relevant for the operator. In
  particular, `errors.errorString` is rather opaque.
* Such a label is unstable - developers can simply rename an internal type.
* Label values are unbound. This makes it harder to estimate the metric
  cardinality and initialize it.

This commit changes the error_type label to take one of the values defined in
the errormetrics package.

Technically it is a breaking change. The metrics is marked as "for internal use
only", so it should be okay.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_errors_total with all possible errors.

Initialize tetragon_handler_errors_total with:
* known handler opcodes (excluding test) and "event_handler_failed" error
* 0 ("unknown") opcode and "unknown_opcode" error

Signed-off-by: Anna Kapuscinska <[email protected]>
Introduce CacheEntryType type and constants representing possible values of
entry_type label on tetragon_event_cache_retries_total metric.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Instead of using the Go type of the event as event_type label on event cache
metrics, use the value defined in the Tetragon API.

This is technically a breaking change. However, it makes the labels more stable
and consistent across metrics.

Signed-off-by: Anna Kapuscinska <[email protected]>
Add event_type label to tetragon_event_cache_errors_total, so that errors can
be aggregated by event type.

Signed-off-by: Anna Kapuscinska <[email protected]>
Introduce CacheError type and constants representing possible values of error
label on tetragon_event_cache_errors_total metric.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_event_cache_retries_total with all possible entry types.

Initialize tetragon_event_cache_<entry_type>_errors_total with all possible
event types.

Initialize tetragon_event_cache_errors_total with all (one) possible errors and
event types.

Signed-off-by: Anna Kapuscinska <[email protected]>
This is done so that we can conveniently iterate over all common flags. The
DecodeCommonFlags() function also reads from this map.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_flags_total with all possible flags.

Signed-off-by: Anna Kapuscinska <[email protected]>
Introduce MergeErrorType type and constants representing possible values of
curr_type and prev_type labels on tetragon_generic_kprobe_merge_errors_total
metric.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_msg_op_total and tetragon_handling_latency with all
possible opcodes.

Signed-off-by: Anna Kapuscinska <[email protected]>
Introduce Subsys type and constants representing possible values of subsys label
on tetragon_policyfilter_metrics_total metric.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Introduce Operation type and constants representing possible values of op label
on tetragon_policyfilter_metrics_total metric.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Remove error label from tetragon_policyfilter_metrics_total metric. This label
wasn't adding any useful information, but its values were unbound, making it
risky from a cardinality perspective.

This also removes direct dependency on github.com/pkg/errors package.

Technically it is a breaking change. The metrics is marked as "for internal use
only", so it should be okay.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_policyfilter_metrics_total with all possible combinations
of subsys and operation.

Signed-off-by: Anna Kapuscinska <[email protected]>
Metrics defined in the processexecmetrics package are not used anywhere, so
let's remove them. They had exec_id as a label - this is risky from a
cardinality perspective, so they shouldn't be used in this form anyway.

Signed-off-by: Anna Kapuscinska <[email protected]>
Introduce Watcher type and constants representing possible values of watcher
label on watcher_<errors|events>_total metrics.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_watcher_errors_total and tetragon_watcher_events_total
metrics with all (one) possible watchers and (also one) errors.

Signed-off-by: Anna Kapuscinska <[email protected]>
The label values are now retrieved via DataEventOp.String() method.

This is done to improve how metrics with labels are initialized and maintained.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_data_events_total with all possible events.

Initialize tetragon_data_event_size with all possible operations. It's a
histogram with quite large number of buckets (20), but only two possible labels
values, so it should be okay.

Signed-off-by: Anna Kapuscinska <[email protected]>
Initialize tetragon_process_loader_stats with all possible event types.

Signed-off-by: Anna Kapuscinska <[email protected]>
@lambdanis lambdanis force-pushed the pr/lambdanis/metrics-init-labels branch from cebc21f to b24e3f1 Compare March 1, 2024 12:47
Copy link

netlify bot commented Mar 1, 2024

Deploy Preview for tetragon ready!

Name Link
🔨 Latest commit b24e3f1
🔍 Latest deploy log https://app.netlify.com/sites/tetragon/deploys/65e1ce6cfc0c780008fc26e3
😎 Deploy Preview https://deploy-preview-2162--tetragon.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@lambdanis
Copy link
Contributor Author

I added a release note to the PR description, covering all changes in metrics & labels. I'm merging this so that #2164 is unblocked.

@lambdanis lambdanis merged commit cdb7946 into main Mar 1, 2024
38 checks passed
@lambdanis lambdanis deleted the pr/lambdanis/metrics-init-labels branch March 1, 2024 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants