New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

RFC - Pipeline Component Telemetry #11406

Open

djaglowski wants to merge 6 commits into open-telemetry:main from djaglowski:component-telemetry-rfc

+158 −0

Member

djaglowski commented Oct 9, 2024

This PR adds a RFC for normalized telemetry across all pipeline components. See #11343

djaglowski mentioned this pull request

Auto-instrumented pipeline components #11343

Closed

codecov bot commented Oct 9, 2024 •

edited

Loading

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.14%. Comparing base (68f0264) to head (35c82a4).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #11406      +/-   ##
==========================================
- Coverage   92.15%   92.14%   -0.02%     
==========================================
  Files         432      432              
  Lines       20291    20296       +5     
==========================================
+ Hits        18700    18702       +2     
- Misses       1228     1231       +3     
  Partials      363      363

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


          RFC - Auto-instrumentation of pipeline components

5df52e1

djaglowski force-pushed the component-telemetry-rfc branch from 99e3086 to 5df52e1 Compare

October 10, 2024 13:05


          Merge branch 'main' into component-telemetry-rfc

djaglowski marked this pull request as ready for review

October 10, 2024 13:36

djaglowski requested a review from a team as a code owner

October 10, 2024 13:36

djaglowski requested a review from songy23

October 10, 2024 13:36

djaglowski added Skip Changelog Skip Contrib Tests labels

codeboten reviewed

View reviewed changes

Contributor

codeboten left a comment

Thanks for opening this as a RFC @djaglowski!

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated

+              - `otel.component.kind`: `connector`
+              - `otel.component.id`: The component ID
+              - `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ALL`**

Contributor

codeboten Oct 10, 2024

In the case of a logs->metrics connector, i would expect this otel.signal to be set to logs for the incoming telemetry and metrics for outgoing telemetry. Are there benefits to including the translation in the attribute?

Member Author

djaglowski Oct 10, 2024

If we annotate the incoming and outgoing counts with separate attributes, then connectors such as count or otlpjson will report a single data stream that aggregates across multiple instances. I believe this could lead to problems very similar to the one described here, where users need to be able to differentiate between instances in order to understand a problem. I think this is fundamentally a question of whether or not we should represent component instances to users, or whether we're trying to aggregate and present them in some unified way. My opinion is that we have tried to gloss over this distinction for a long time but the nuances matter in some cases and we're not doing users a favor by hiding the distinction.

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated


		A span should be recorded for each execution of a processor or connector. The instrumentation layers adjacent to these components can start and end the span as appropriate.

		### Logs

Contributor

codeboten Oct 10, 2024

Would it be useful to capture details about logs that automatically report the status (ie started) of components in this RFC? I'm wondering if we should ensure that those logs include attributes listed earlier in this document.

Member Author

djaglowski Oct 10, 2024

I'm not sure the mechanism I've described is able to do this. Also, since component status is still unstable, I'd prefer not to take that as a dependency here. My goal with this RFC is not to describe all standard telemetry for pipeline components, but to describe what can be done with the proposed mechanism.

Member

mx-psi Oct 17, 2024

Agreed that this can be out of scope for now

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

djaglowski and others added 2 commits

October 10, 2024 11:45


          Update docs/rfcs/component-universal-telemetry.md

d0f1637

Co-authored-by: Alex Boten <[email protected]>


          Feedback

bea0e2f

jmacd mentioned this pull request

WIP: pipeline monitoring otep open-telemetry/oteps#259

Closed

bogdandrutu reviewed

View reviewed changes

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated


		The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently.

		### Attributes

Member

bogdandrutu Oct 10, 2024

Talk about the "scope" as well.

Member Author

djaglowski Oct 16, 2024

I've added language about the instrumentation scope.

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

docs/rfcs/component-universal-telemetry.md Outdated

Comment on lines 56 to 59

+              The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the component to which the telemetry is ascribed.
+. Incoming measurements are attributed to the component which is _consuming_ the data.
+. Outgoing measurements are attributed to the component which is _producing_ the data.

Member

bogdandrutu Oct 10, 2024

Challenge for this: Do we really need both? If ever component records this, the outgoing of component X will be equal with next component Y metrics.

Member

andrzej-stencel Oct 11, 2024 •

edited

Loading

I was also considering this. Are there situations where outgoing metric of component A is not equal to incoming metric of component B? I think this is the same information. Does splitting this into two different metrics help in usage?

I was considering a "one metric per graph edge" approach, as described in the below examples.

The advantage of the "one metric per edge" approach is that with sufficiently elaborate pipelines, the amount of metrics is cut in half (asymptotically).

Is there information that cannot be expressed with this "one metric per edge" approach? Are there other disadvantages, e.g. usability?

Example 1: Simple pipeline with Filter processor

exporters:
  debug:
processors:
  filter:
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_WARN'
receivers:
  filelog:
    include:
      - logs.json
service:
  pipelines:
    logs:
      exporters:
        - debug
      processors:
        - filter
      receivers:
        - filelog

Metrics according to this proposal:

otelcol_component_incoming_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 2
otelcol_component_outgoing_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 2
otelcol_component_incoming_items{otel.component.kind="processor",otel.component.id="filter",otel.signal="logs"} 2
otelcol_component_outgoing_items{otel.component.kind="processor",otel.component.id="filter",otel.signal="logs"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1

One metric per graph edge:

otelcol_component[_incoming]_items{otel[.destination].component.kind="receiver",otel[.destination].component.id="filelog",otel.signal="logs"} 2
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="processor",otel.destination.component.id="filter",otel.signal="logs"} 2
otelcol_component_items{otel.source.component.kind="processor",otel.source.component.id="filter",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="logs"} 1
otelcol_component[_outgoing]_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1

Example 2: Pipeline with connector

connectors:
  otlpjson:
exporters:
  debug:
receivers:
  filelog:
    include:
      - otlp-log.json
      - otlp-metric.json
      - otlp-span.json
service:
  pipelines:
    logs/input:
      exporters: [otlpjson]
      receivers: [filelog]
    logs/otlp:
      exporters: [debug]
      receivers: [otlpjson]
    metrics/otlp:
      exporters: [debug]
      receivers: [otlpjson]
    traces/otlp:
      exporters: [debug]
      receivers: [otlpjson]

This proposal:

otelcol_component_incoming_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 3
otelcol_component_outgoing_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 3

otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->logs"} 3
otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->metrics"} 3
otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->traces"} 3
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->logs"} 1
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->metrics"} 1
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->traces"} 1

otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="logs"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="logs"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="metrics"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="metrics"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="traces"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="traces"} 1

One metric per edge:

otelcol_component[_incoming]_items{otel[.destination].component.kind="receiver",otel[.destination].component.id="filelog",otel.signal="logs"} 3

otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="logs"} 3
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="metrics"} 3
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="traces"} 3

otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="logs"} 1
otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="metrics"} 1
otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="traces"} 1

otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="logs"} 1
otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="metrics"} 1
otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="traces"} 1

Small note that the OTLP/JSON connector's behavior is currently actually different, see open-telemetry/opentelemetry-collector-contrib#35738. I've described the metrics as they would look like when that issue is resolved (connector does not emit empty batches).

Member Author

djaglowski Oct 11, 2024 •

edited

Loading

Do we really need both? If ever component records this, the outgoing of component X will be equal with next component Y metrics.

This is only true for linear pipelines. Any fanout or merge of data streams breaks this. e.g. Receiver A emits 10 items, receiver B emits 20, processor C consumes 30.

The advantage of the "one metric per edge" approach is that with sufficiently elaborate pipelines, the amount of metrics is cut in half (asymptotically).

Philosophically, I think users should reasonably expect that they can observe an object in isolation to get a relatively complete picture of how it is behaving. Offloading half of the picture for the sake of reducing data volume is a premature optimization in my opinion. In any case, if we support in some way the ability to disable individual metrics, then users can just disable one or the other to opt into exactly this same optimization.

More concretely, say we measure only incoming values. Some things that immediately become difficult for users:

How much data is my filter processor discarding? I can find the incoming value easily, but in order to answer the question I also have to figure out which component(s) it is sending data to, query for their incoming values, aggregate those values, then compare to the incoming. I'm sure this is easier in some backends than others but I don't think users will often write queries for this - they will just accept a lack of visibility. The query is even more difficult if you want it to be resilient to changes in the configuration.
How much data is my receiver emitting? Same convoluted answer as above.
In a linear pipeline, what is the net effect of all my processors? Same convoluted answer as above (vs just comparing outgoing from the receiver to incoming of the exporter).

The same problems exists if you only measure outgoing. The problem are even worse if we actually pin metrics to edges by including attributes from both components.

docs/rfcs/component-universal-telemetry.md Outdated

+. Incoming measurements are attributed to the component which is _consuming_ the data.
+. Outgoing measurements are attributed to the component which is _producing_ the data.
+              For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the function call returned an error. Outgoing measurements will be recorded with `outcome` as `failure` when the next consumer returns an error, and `success` otherwise. Likewise, incoming measurements will be recorded with `outcome` as `failure` when the component itself returns an error, and `success` otherwise.

Member

bogdandrutu Oct 10, 2024

Which function call?

Member Author

djaglowski Oct 11, 2024

For incoming, it is the call from the previous component to the described component's ConsumeX function. For outgoing, it is the call from the described component to the next component's ConsumeX function.

I'll clarify in the doc too.

Member Author

djaglowski Oct 16, 2024

I've added clarification in the doc.

docs/rfcs/component-universal-telemetry.md

Comment on lines +79 to +92

+                  otelcol_component_incoming_size:
+                    enabled: false
+                    description: Size of items passed to the component.
+                    unit: "By"
+                    sum:
+                      value_type: int
+                      monotonic: true
+                  otelcol_component_outgoing_size:
+                    enabled: false
+                    description: Size of items emitted from the component.
+                    unit: "By"
+                    sum:
+                      value_type: int
+                      monotonic: true

Member

bogdandrutu Oct 10, 2024

This is something that triggers another question to me:

To control if a metric is enabled and recorded is in the creation of the SDK. We will need a way for components (for this metrics as well) to communicate this with the service telemetry.

Member Author

djaglowski Oct 11, 2024

I am not up to speed on the latest design but wouldn't it be intuitive that component.TelemetrySettings is able to answer the question of whether or not a metric is enabled?

Contributor

codeboten Oct 11, 2024

To control if a metric is enabled and recorded is in the creation of the SDK. We will need a way for components (for this metrics as well) to communicate this with the service telemetry.

This makes me think that the components will need to be able to return views (we've talked about this a few times).

The alternative here would be to only support enabling/disabling of metrics via view configuration.

Member

bogdandrutu Oct 13, 2024 •

edited

Loading

This makes me think that the components will need to be able to return views (we've talked about this a few times).

Yes, that would be great. I think would be Factories though, since we need this before pipelines are created.

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved


          Feedback

35c82a4

bogdandrutu reviewed

View reviewed changes

docs/rfcs/component-universal-telemetry.md

+              For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise.
+              ```yaml
+                  otelcol_component_incoming_items:

Member

bogdandrutu Oct 16, 2024

I would call them "consumer", since consumer is about moving data, components only start/stop

Member Author

djaglowski Oct 16, 2024

otelcol_component_consumed_items and otelcol_component_produced_items?

Member

mx-psi Oct 17, 2024

otelcol_component_consumed_items and otelcol_component_produced_items?

I like these names

bogdandrutu reviewed

View reviewed changes

docs/rfcs/component-universal-telemetry.md

+              For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise.
+              ```yaml
+                  otelcol_component_incoming_items:

Member

bogdandrutu Oct 16, 2024

There is an advantage to have the otel.component.kind in the name of the metric vs as a label:

The rule of thumb for having a label vs being in the name is if the sum across that label makes sense or not. I think actually it doesn't make that much sense.
The mechanism of disabling (you can still disable the calculation, but not the recording since you don't know which labels will be recorded) and avoiding computation is at "metric" level.

I think it may not be bad to have per component kind metrics. I would not necessary go to have per component kind and type level, but that can be another option.

Member Author

djaglowski Oct 16, 2024

The rule of thumb for having a label vs being in the name is if the sum across that label makes sense or not. I think actually it doesn't make that much sense.

I am on the fence about this. I agree the case is weak but aggregation along this dimension might at least give a sense of scale.

What I mean is that if you consider each component as doing some non-trivial amount of work for a amount of data, you can aggregate to understand relatively how much work the collector is doing internally. e.g. Say you are considering some changes to a configuration and compare v(n) to v(n+1). If you aggregate across all dimensions of "incoming" you may see that even with the same inputs the collector is having 10x items passed between components.

Either way, I'm not necessarily against having metrics per kind. Some consequences though:

To have consistent attributes with other signals, we should still apply the otel.signal attribute, even though this is redundant with the metric name.
It's more difficult for users to answer questions like, "which of the components is producing the most items" because they have to join across 3 metrics before sorting?

The mechanism of disabling (you can still disable the calculation, but not the recording since you don't know which labels will be recorded) and avoiding computation is at "metric" level.

I'm not sure I understand. Are you pointing out that users would be forced to enable/disable the metric for all kinds, but maybe they should be able to control it for each kind independently? e.g. Only record items produced by receivers and consumed by exporters?


          Broaden scope and convert to evolving consensus

djaglowski changed the title ~~RFC - Auto-instrumentation of pipeline components~~ RFC - Pipeline Component Telemetry

Member Author

djaglowski commented Oct 16, 2024

Based on some offline feedback, I've broadened the scope of the RFC, while simultaneously clarifying that it is intended to evolve as we identify additional standards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

codeboten codeboten left review comments

bogdandrutu bogdandrutu left review comments

mx-psi mx-psi left review comments

andrzej-stencel andrzej-stencel left review comments

songy23 Awaiting requested review from songy23 songy23 is a code owner automatically assigned from open-telemetry/collector-approvers

At least 1 approving review is required to merge this pull request.

Labels

Skip Changelog Skip Contrib Tests