Initialize metrics with labels #2162

lambdanis · 2024-02-28T13:11:43Z

So far metrics with labels were registered but not initialized at startup. This leads to unpredictable resources usage and limits certain registry operations, e.g. linting or autogenerating docs.

This PR initializes metrics with known labels. It required explicitly defining known labels, so there are lots of commits doing just that - defining types for label values. Let me know if you want me to squash some of these commits. I also used InitMetrics functions as a notebook and left some comments for the future about metrics naming - hope that's ok.

A few remarks about the impact and limitations of this approach:

Initializing metrics at startup means increased initial resources usage. However, it also means that metrics volume is more predictable and can be planned for - that's good.
I changed some (not many) label values to be more useful. These are technically breaking changes. Details are documented in relevant commit messages.
Some metrics have labels that depend on the environment: process metadata (namespace, workload, pod, binary), policy name, hook point. These metrics are not initialized.
Some metrics have known labels, but are registered as custom collectors. These are also not initialized.

Metrics with known labels values are initialized to 0 on startup.

This helps to ensure stable resources usage and metrics queries. This also involves changes in several metrics labels:
* error_type label on tetragon_handler_errors_total metric is either "unknown_opcode" or "event_handler_failed" instead of the Go type of the error
* event_type label on tetragon_event_cache*_errors_total metrics is one of the values defined in Tetragon API (tetragon.EventType) instead of the Go type of the event
* error label on tetragon_event_cache_errors_total metric is "nil_process_pid"
* error label is removed from tetragon_policyfilter_metrics_total metric

willfindlay · 2024-02-28T20:21:47Z

pkg/grpc/tracing/stats.go

+	// NOTES:
+	// * Rename process_loader_stats metric (to e.g. process_loader_events_total) and count label (to e.g. event)?


Does it make sense to link GH issues here?

yeah, I think so, just I'm not 100% sure which of these points are worth implementing and which not. Many of these are technically breaking changes, but also many metrics are probably used only by Tetragon developers so 🤷‍♀️ I need one more pass over the metrics and maybe another pair of eyes, then I'll open issues for improvements.

willfindlay · 2024-02-28T20:23:15Z

pkg/metrics/errormetrics/errormetrics.go

+func (e ErrorType) String() string {
+	return errorTypeLabelValues[e]
+}


Should we have a fallback here for missing types to avoid panics?

It won't panic, in case e is not in errorTypeLabelValues it will return an empty string (zero value).

pkg/api/ops/ops.go

willfindlay · 2024-02-28T20:27:45Z

pkg/metrics/eventcachemetrics/eventcachemetrics.go

+}
+
+func (t CacheEntryType) String() string {
+	return cacheEntryTypeLabelValues[t]


Same comment as in errormetrics.

willfindlay · 2024-02-28T20:31:35Z

pkg/metrics/kprobemetrics/kprobemetrics.go

+}
+
+func (t MergeErrorType) String() string {
+	return mergeErrorTypeLabelValues[t]


Same comment as errormetrics.

willfindlay · 2024-02-28T20:32:19Z

pkg/metrics/policyfiltermetrics/policyfiltermetrics.go

+}
+
+func (s Subsys) String() string {
+	return subsysLabelValues[s]


Same comment as errormetrics.

willfindlay · 2024-02-28T20:32:44Z

pkg/metrics/watchermetrics/watchermetrics.go

+}
+
+func (w Watcher) String() string {
+	return watcherTypeLabelValues[w]


Same comment as errormetrics.

willfindlay · 2024-02-28T20:33:28Z

LGTM, I just left a few nits/comments. Feel free to take them or leave them.

This commit is a preparation for further metrics improvements. There are no exciting changes here, only adding some comments about metrics naming to track further work. Signed-off-by: Anna Kapuscinska <[email protected]>

The label values are now retrieved via ErrorType.String() method. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

This is done so that we can conveniently iterate over all known opcodes. The OpCode.String() method also reads from this map. Signed-off-by: Anna Kapuscinska <[email protected]>

Signed-off-by: Anna Kapuscinska <[email protected]>

Before, error_type label on tetragon_handler_errors_total metric was set to the Go type of the error that occurred. This is suboptimal for a few reasons: * Go type of the error isn't necessarily relevant for the operator. In particular, `errors.errorString` is rather opaque. * Such a label is unstable - developers can simply rename an internal type. * Label values are unbound. This makes it harder to estimate the metric cardinality and initialize it. This commit changes the error_type label to take one of the values defined in the errormetrics package. Technically it is a breaking change. The metrics is marked as "for internal use only", so it should be okay. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_errors_total with all possible errors. Initialize tetragon_handler_errors_total with: * known handler opcodes (excluding test) and "event_handler_failed" error * 0 ("unknown") opcode and "unknown_opcode" error Signed-off-by: Anna Kapuscinska <[email protected]>

Introduce CacheEntryType type and constants representing possible values of entry_type label on tetragon_event_cache_retries_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Instead of using the Go type of the event as event_type label on event cache metrics, use the value defined in the Tetragon API. This is technically a breaking change. However, it makes the labels more stable and consistent across metrics. Signed-off-by: Anna Kapuscinska <[email protected]>

Add event_type label to tetragon_event_cache_errors_total, so that errors can be aggregated by event type. Signed-off-by: Anna Kapuscinska <[email protected]>

Introduce CacheError type and constants representing possible values of error label on tetragon_event_cache_errors_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_event_cache_retries_total with all possible entry types. Initialize tetragon_event_cache_<entry_type>_errors_total with all possible event types. Initialize tetragon_event_cache_errors_total with all (one) possible errors and event types. Signed-off-by: Anna Kapuscinska <[email protected]>

This is done so that we can conveniently iterate over all common flags. The DecodeCommonFlags() function also reads from this map. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_flags_total with all possible flags. Signed-off-by: Anna Kapuscinska <[email protected]>

Introduce MergeErrorType type and constants representing possible values of curr_type and prev_type labels on tetragon_generic_kprobe_merge_errors_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_msg_op_total and tetragon_handling_latency with all possible opcodes. Signed-off-by: Anna Kapuscinska <[email protected]>

Introduce Subsys type and constants representing possible values of subsys label on tetragon_policyfilter_metrics_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Introduce Operation type and constants representing possible values of op label on tetragon_policyfilter_metrics_total metric. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Remove error label from tetragon_policyfilter_metrics_total metric. This label wasn't adding any useful information, but its values were unbound, making it risky from a cardinality perspective. This also removes direct dependency on github.com/pkg/errors package. Technically it is a breaking change. The metrics is marked as "for internal use only", so it should be okay. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_policyfilter_metrics_total with all possible combinations of subsys and operation. Signed-off-by: Anna Kapuscinska <[email protected]>

Metrics defined in the processexecmetrics package are not used anywhere, so let's remove them. They had exec_id as a label - this is risky from a cardinality perspective, so they shouldn't be used in this form anyway. Signed-off-by: Anna Kapuscinska <[email protected]>

Introduce Watcher type and constants representing possible values of watcher label on watcher_<errors|events>_total metrics. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_watcher_errors_total and tetragon_watcher_events_total metrics with all (one) possible watchers and (also one) errors. Signed-off-by: Anna Kapuscinska <[email protected]>

The label values are now retrieved via DataEventOp.String() method. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_data_events_total with all possible events. Initialize tetragon_data_event_size with all possible operations. It's a histogram with quite large number of buckets (20), but only two possible labels values, so it should be okay. Signed-off-by: Anna Kapuscinska <[email protected]>

Initialize tetragon_process_loader_stats with all possible event types. Signed-off-by: Anna Kapuscinska <[email protected]>

Signed-off-by: Anna Kapuscinska <[email protected]>

netlify · 2024-03-01T12:48:22Z

✅ Deploy Preview for tetragon ready!

Name	Link
🔨 Latest commit	`b24e3f1`
🔍 Latest deploy log	https://app.netlify.com/sites/tetragon/deploys/65e1ce6cfc0c780008fc26e3
😎 Deploy Preview	https://deploy-preview-2162--tetragon.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

lambdanis · 2024-03-01T16:32:34Z

I added a release note to the PR description, covering all changes in metrics & labels. I'm merging this so that #2164 is unblocked.

pkg/metrics/policyfiltermetrics/policyfiltermetrics.go

lambdanis added area/metrics Related to prometheus metrics release-note/minor This PR introduces a minor user-visible change labels Feb 28, 2024

lambdanis requested a review from a team as a code owner February 28, 2024 13:11

lambdanis requested a review from tpapagian February 28, 2024 13:11

lambdanis force-pushed the pr/lambdanis/metrics-init-labels branch from 0c08d05 to 5302f27 Compare February 28, 2024 14:03

This was referenced Feb 28, 2024

Generate metrics reference docs #2124

Closed

Generate metrics reference docs #2164

Merged

willfindlay approved these changes Feb 28, 2024

View reviewed changes

lambdanis force-pushed the pr/lambdanis/metrics-init-labels branch from f70a3ee to cebc21f Compare February 29, 2024 23:20

lambdanis added 20 commits March 1, 2024 12:42

metrics: Add comments about metrics naming

269a1cc

This commit is a preparation for further metrics improvements. There are no exciting changes here, only adding some comments about metrics naming to track further work. Signed-off-by: Anna Kapuscinska <[email protected]>

errormetrics: Define error types as integers not strings

af88716

The label values are now retrieved via ErrorType.String() method. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

errormetrics: Use ops.OpCode type instead of basic ints

d2724f8

This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

api/ops: Define OpCodeStrings map with opcode string values

187592b

This is done so that we can conveniently iterate over all known opcodes. The OpCode.String() method also reads from this map. Signed-off-by: Anna Kapuscinska <[email protected]>

api/ops: Add missing OpCode values

fea557c

Signed-off-by: Anna Kapuscinska <[email protected]>

eventcachemetrics: Add event_type label to errors metric

decdb63

Add event_type label to tetragon_event_cache_errors_total, so that errors can be aggregated by event type. Signed-off-by: Anna Kapuscinska <[email protected]>

reader/exec: Define FlagStrings map with flag string values

2e6e219

This is done so that we can conveniently iterate over all common flags. The DecodeCommonFlags() function also reads from this map. Signed-off-by: Anna Kapuscinska <[email protected]>

eventmetrics: Initialize metrics with labels

546bf72

Initialize tetragon_flags_total with all possible flags. Signed-off-by: Anna Kapuscinska <[email protected]>

opcodemetrics: Use ops.OpCode type instead of basic ints

d90fc84

This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

opcodemetrics: Initialize metrics with labels

efe5cd7

Initialize tetragon_msg_op_total and tetragon_handling_latency with all possible opcodes. Signed-off-by: Anna Kapuscinska <[email protected]>

lambdanis added 8 commits March 1, 2024 12:47

policyfiltermetrics: Initialize metrics with labels

37739a5

Initialize tetragon_policyfilter_metrics_total with all possible combinations of subsys and operation. Signed-off-by: Anna Kapuscinska <[email protected]>

watchermetrics: Initialize metrics with labels

4f29532

Initialize tetragon_watcher_errors_total and tetragon_watcher_events_total metrics with all (one) possible watchers and (also one) errors. Signed-off-by: Anna Kapuscinska <[email protected]>

observer: Define op as integers not strings

5cdc9b4

The label values are now retrieved via DataEventOp.String() method. This is done to improve how metrics with labels are initialized and maintained. Signed-off-by: Anna Kapuscinska <[email protected]>

tracing: Initialize metrics with labels

f5f9629

Initialize tetragon_process_loader_stats with all possible event types. Signed-off-by: Anna Kapuscinska <[email protected]>

api/ops: Make OpCode.String() more informative for unknown opcodes

b24e3f1

Signed-off-by: Anna Kapuscinska <[email protected]>

lambdanis force-pushed the pr/lambdanis/metrics-init-labels branch from cebc21f to b24e3f1 Compare March 1, 2024 12:47

lambdanis merged commit cdb7946 into main Mar 1, 2024
38 checks passed

lambdanis deleted the pr/lambdanis/metrics-init-labels branch March 1, 2024 16:32

kkourt reviewed Apr 1, 2024

View reviewed changes

pkg/metrics/policyfiltermetrics/policyfiltermetrics.go Show resolved Hide resolved

kkourt mentioned this pull request Apr 1, 2024

policyfilter: Apply Policy Only to Specific Containers in a Pod #2231

Merged

lambdanis mentioned this pull request Apr 25, 2024

Improve metrics library to enforce best practices #2376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize metrics with labels #2162

Initialize metrics with labels #2162

lambdanis commented Feb 28, 2024 •

edited by kkourt

Loading

willfindlay Feb 28, 2024

lambdanis Feb 29, 2024

willfindlay Feb 28, 2024

lambdanis Feb 29, 2024

willfindlay Feb 28, 2024

willfindlay Feb 28, 2024

willfindlay Feb 28, 2024

willfindlay Feb 28, 2024

willfindlay commented Feb 28, 2024

netlify bot commented Mar 1, 2024

lambdanis commented Mar 1, 2024

		// NOTES:
		// * Rename process_loader_stats metric (to e.g. process_loader_events_total) and count label (to e.g. event)?

Initialize metrics with labels #2162

Initialize metrics with labels #2162

Conversation

lambdanis commented Feb 28, 2024 • edited by kkourt Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

willfindlay commented Feb 28, 2024

netlify bot commented Mar 1, 2024

✅ Deploy Preview for tetragon ready!

lambdanis commented Mar 1, 2024

lambdanis commented Feb 28, 2024 •

edited by kkourt

Loading