Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Pipeline Component Telemetry #11406

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

djaglowski
Copy link
Member

This PR adds a RFC for normalized telemetry across all pipeline components. See #11343

Copy link

codecov bot commented Oct 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.14%. Comparing base (68f0264) to head (35c82a4).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #11406      +/-   ##
==========================================
- Coverage   92.15%   92.14%   -0.02%     
==========================================
  Files         432      432              
  Lines       20291    20296       +5     
==========================================
+ Hits        18700    18702       +2     
- Misses       1228     1231       +3     
  Partials      363      363              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@djaglowski djaglowski marked this pull request as ready for review October 10, 2024 13:36
@djaglowski djaglowski requested a review from a team as a code owner October 10, 2024 13:36
@djaglowski djaglowski added Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests labels Oct 10, 2024
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this as a RFC @djaglowski!

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

- `otel.component.kind`: `connector`
- `otel.component.id`: The component ID
- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ALL`**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of a logs->metrics connector, i would expect this otel.signal to be set to logs for the incoming telemetry and metrics for outgoing telemetry. Are there benefits to including the translation in the attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we annotate the incoming and outgoing counts with separate attributes, then connectors such as count or otlpjson will report a single data stream that aggregates across multiple instances. I believe this could lead to problems very similar to the one described here, where users need to be able to differentiate between instances in order to understand a problem. I think this is fundamentally a question of whether or not we should represent component instances to users, or whether we're trying to aggregate and present them in some unified way. My opinion is that we have tried to gloss over this distinction for a long time but the nuances matter in some cases and we're not doing users a favor by hiding the distinction.

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

A span should be recorded for each execution of a processor or connector. The instrumentation layers adjacent to these components can start and end the span as appropriate.

### Logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to capture details about logs that automatically report the status (ie started) of components in this RFC? I'm wondering if we should ensure that those logs include attributes listed earlier in this document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the mechanism I've described is able to do this. Also, since component status is still unstable, I'd prefer not to take that as a dependency here. My goal with this RFC is not to describe all standard telemetry for pipeline components, but to describe what can be done with the proposed mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this can be out of scope for now

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved

The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently.

### Attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about the "scope" as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added language about the instrumentation scope.

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
Comment on lines 56 to 59
The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the component to which the telemetry is ascribed.

1. Incoming measurements are attributed to the component which is _consuming_ the data.
2. Outgoing measurements are attributed to the component which is _producing_ the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Challenge for this: Do we really need both? If ever component records this, the outgoing of component X will be equal with next component Y metrics.

Copy link
Member

@andrzej-stencel andrzej-stencel Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also considering this. Are there situations where outgoing metric of component A is not equal to incoming metric of component B? I think this is the same information. Does splitting this into two different metrics help in usage?

I was considering a "one metric per graph edge" approach, as described in the below examples.

The advantage of the "one metric per edge" approach is that with sufficiently elaborate pipelines, the amount of metrics is cut in half (asymptotically).

Is there information that cannot be expressed with this "one metric per edge" approach? Are there other disadvantages, e.g. usability?

Example 1: Simple pipeline with Filter processor
exporters:
  debug:
processors:
  filter:
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_WARN'
receivers:
  filelog:
    include:
      - logs.json
service:
  pipelines:
    logs:
      exporters:
        - debug
      processors:
        - filter
      receivers:
        - filelog

Metrics according to this proposal:

otelcol_component_incoming_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 2
otelcol_component_outgoing_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 2
otelcol_component_incoming_items{otel.component.kind="processor",otel.component.id="filter",otel.signal="logs"} 2
otelcol_component_outgoing_items{otel.component.kind="processor",otel.component.id="filter",otel.signal="logs"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1

One metric per graph edge:

otelcol_component[_incoming]_items{otel[.destination].component.kind="receiver",otel[.destination].component.id="filelog",otel.signal="logs"} 2
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="processor",otel.destination.component.id="filter",otel.signal="logs"} 2
otelcol_component_items{otel.source.component.kind="processor",otel.source.component.id="filter",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="logs"} 1
otelcol_component[_outgoing]_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1
Example 2: Pipeline with connector
connectors:
  otlpjson:
exporters:
  debug:
receivers:
  filelog:
    include:
      - otlp-log.json
      - otlp-metric.json
      - otlp-span.json
service:
  pipelines:
    logs/input:
      exporters: [otlpjson]
      receivers: [filelog]
    logs/otlp:
      exporters: [debug]
      receivers: [otlpjson]
    metrics/otlp:
      exporters: [debug]
      receivers: [otlpjson]
    traces/otlp:
      exporters: [debug]
      receivers: [otlpjson]

This proposal:

otelcol_component_incoming_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 3
otelcol_component_outgoing_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 3

otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->logs"} 3
otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->metrics"} 3
otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->traces"} 3
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->logs"} 1
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->metrics"} 1
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->traces"} 1

otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="logs"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="logs"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="metrics"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="metrics"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="traces"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="traces"} 1

One metric per edge:

otelcol_component[_incoming]_items{otel[.destination].component.kind="receiver",otel[.destination].component.id="filelog",otel.signal="logs"} 3

otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="logs"} 3
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="metrics"} 3
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="traces"} 3

otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="logs"} 1
otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="metrics"} 1
otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="traces"} 1

otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="logs"} 1
otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="metrics"} 1
otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="traces"} 1

Small note that the OTLP/JSON connector's behavior is currently actually different, see open-telemetry/opentelemetry-collector-contrib#35738. I've described the metrics as they would look like when that issue is resolved (connector does not emit empty batches).

Copy link
Member Author

@djaglowski djaglowski Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need both? If ever component records this, the outgoing of component X will be equal with next component Y metrics.

This is only true for linear pipelines. Any fanout or merge of data streams breaks this. e.g. Receiver A emits 10 items, receiver B emits 20, processor C consumes 30.

The advantage of the "one metric per edge" approach is that with sufficiently elaborate pipelines, the amount of metrics is cut in half (asymptotically).

Philosophically, I think users should reasonably expect that they can observe an object in isolation to get a relatively complete picture of how it is behaving. Offloading half of the picture for the sake of reducing data volume is a premature optimization in my opinion. In any case, if we support in some way the ability to disable individual metrics, then users can just disable one or the other to opt into exactly this same optimization.

More concretely, say we measure only incoming values. Some things that immediately become difficult for users:

  1. How much data is my filter processor discarding? I can find the incoming value easily, but in order to answer the question I also have to figure out which component(s) it is sending data to, query for their incoming values, aggregate those values, then compare to the incoming. I'm sure this is easier in some backends than others but I don't think users will often write queries for this - they will just accept a lack of visibility. The query is even more difficult if you want it to be resilient to changes in the configuration.
  2. How much data is my receiver emitting? Same convoluted answer as above.
  3. In a linear pipeline, what is the net effect of all my processors? Same convoluted answer as above (vs just comparing outgoing from the receiver to incoming of the exporter).

The same problems exists if you only measure outgoing. The problem are even worse if we actually pin metrics to edges by including attributes from both components.

1. Incoming measurements are attributed to the component which is _consuming_ the data.
2. Outgoing measurements are attributed to the component which is _producing_ the data.

For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the function call returned an error. Outgoing measurements will be recorded with `outcome` as `failure` when the next consumer returns an error, and `success` otherwise. Likewise, incoming measurements will be recorded with `outcome` as `failure` when the component itself returns an error, and `success` otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which function call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For incoming, it is the call from the previous component to the described component's ConsumeX function. For outgoing, it is the call from the described component to the next component's ConsumeX function.

I'll clarify in the doc too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added clarification in the doc.

Comment on lines +79 to +92
otelcol_component_incoming_size:
enabled: false
description: Size of items passed to the component.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol_component_outgoing_size:
enabled: false
description: Size of items emitted from the component.
unit: "By"
sum:
value_type: int
monotonic: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that triggers another question to me:

  • To control if a metric is enabled and recorded is in the creation of the SDK. We will need a way for components (for this metrics as well) to communicate this with the service telemetry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not up to speed on the latest design but wouldn't it be intuitive that component.TelemetrySettings is able to answer the question of whether or not a metric is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To control if a metric is enabled and recorded is in the creation of the SDK. We will need a way for components (for this metrics as well) to communicate this with the service telemetry.

This makes me think that the components will need to be able to return views (we've talked about this a few times).

The alternative here would be to only support enabling/disabling of metrics via view configuration.

Copy link
Member

@bogdandrutu bogdandrutu Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that the components will need to be able to return views (we've talked about this a few times).

Yes, that would be great. I think would be Factories though, since we need this before pipelines are created.

docs/rfcs/component-universal-telemetry.md Outdated Show resolved Hide resolved
For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise.

```yaml
otelcol_component_incoming_items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call them "consumer", since consumer is about moving data, components only start/stop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelcol_component_consumed_items and otelcol_component_produced_items?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelcol_component_consumed_items and otelcol_component_produced_items?

I like these names

For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the corresponding function call returned an error. Specifically, incoming measurements will be recorded with `outcome` as `failure` when a call from the previous component the `ConsumeX` function returns an error, and `success` otherwise. Likewise, outgoing measurements will be recorded with `outcome` as `failure` when a call to the next consumer's `ConsumeX` function returns an error, and `success` otherwise.

```yaml
otelcol_component_incoming_items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an advantage to have the otel.component.kind in the name of the metric vs as a label:

  • The rule of thumb for having a label vs being in the name is if the sum across that label makes sense or not. I think actually it doesn't make that much sense.
  • The mechanism of disabling (you can still disable the calculation, but not the recording since you don't know which labels will be recorded) and avoiding computation is at "metric" level.

I think it may not be bad to have per component kind metrics. I would not necessary go to have per component kind and type level, but that can be another option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule of thumb for having a label vs being in the name is if the sum across that label makes sense or not. I think actually it doesn't make that much sense.

I am on the fence about this. I agree the case is weak but aggregation along this dimension might at least give a sense of scale.

What I mean is that if you consider each component as doing some non-trivial amount of work for a amount of data, you can aggregate to understand relatively how much work the collector is doing internally. e.g. Say you are considering some changes to a configuration and compare v(n) to v(n+1). If you aggregate across all dimensions of "incoming" you may see that even with the same inputs the collector is having 10x items passed between components.

Either way, I'm not necessarily against having metrics per kind. Some consequences though:

  • To have consistent attributes with other signals, we should still apply the otel.signal attribute, even though this is redundant with the metric name.
  • It's more difficult for users to answer questions like, "which of the components is producing the most items" because they have to join across 3 metrics before sorting?

The mechanism of disabling (you can still disable the calculation, but not the recording since you don't know which labels will be recorded) and avoiding computation is at "metric" level.

I'm not sure I understand. Are you pointing out that users would be forced to enable/disable the metric for all kinds, but maybe they should be able to control it for each kind independently? e.g. Only record items produced by receivers and consumed by exporters?

@djaglowski djaglowski changed the title RFC - Auto-instrumentation of pipeline components RFC - Pipeline Component Telemetry Oct 16, 2024
@djaglowski
Copy link
Member Author

Based on some offline feedback, I've broadened the scope of the RFC, while simultaneously clarifying that it is intended to evolve as we identify additional standards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry Skip Contrib Tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants