Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC - Pipeline Component Telemetry #11406

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions docs/rfcs/component-universal-telemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Auto-Instrumented Component Telemetry
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

## Motivation

The collector should be observable and this must naturally include observability of its pipeline components. It is understood that each _type_ (`filelog`, `batch`, etc) of component may emit telemetry describing its internal workings, and that these internally derived signals may vary greatly based on the concerns and maturity of each component. Naturally though, the collector should also describe the behavior of components using broadly normalized telemetry. A major challenge in pursuit is that there must be a clear mechanism by which such telemetry can be automatically captured. Therefore, this RFC is first and foremost a proposal for a _mechanism_. Then, based on what _can_ be captured by this mechanism, the RFC describes specific metrics and logs which can be broadly normalized.

## Goals

1. Articulate a mechanism which enables us to _automatically_ capture telemetry from _all pipeline components_.
2. Define attributes that are (A) specific enough to describe individual component [_instances_](https://github.com/open-telemetry/opentelemetry-collector/issues/10534) and (B) consistent enough for correlation across signals.
3. Define specific metrics for each kind of pipeline component.
4. Define specific logs for all kinds of pipeline component.

### Mechanism

The mechanism of telemetry capture should be _external_ to components. Specifically, we should observe telemetry at each point where a component passes data to another component, and, at each point where a component consumes data from another component. In terms of the component graph, every _edge_ in the graph will have two layers of instrumentation - one for the producing component and one for the consuming component. Importantly, each layer generates telemetry ascribed to a single component instance, so by having two layers per edge we can describe both sides of each handoff independently.

### Attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talk about the "scope" as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added language about the instrumentation scope.


All signals should use the following attributes:

#### Receivers

- `otel.component.kind`: `receiver`
- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ANY`**

#### Processors

- `otel.component.kind`: `processor`
- `otel.component.id`: The component ID
- `otel.pipeline.id`: The pipeline ID, **OR `ANY`**
- `otel.signal`: `logs`, `metrics`, `traces`, **OR `ANY`**

#### Exporters

- `otel.component.kind`: `exporter`
- `otel.component.id`: The component ID
- `otel.signal`: `logs`, `metrics` `traces`, **OR `ANY`**

#### Connectors

- `otel.component.kind`: `connector`
- `otel.component.id`: The component ID
- `otel.signal`: `logs->logs`, `logs->metrics`, `logs->traces`, `metrics->logs`, `metrics->metrics`, etc, **OR `ANY`**
bogdandrutu marked this conversation as resolved.
Show resolved Hide resolved

Notes: The use of `ANY` indicates that values are not associated with a particular signal or pipeline. This is used when a component enforces non-standard instancing patterns. For example, the `otlp` receiver isa singleton, so the values are aggregated across signals. Similarly, the `memory_limiter` processor is a singleton, so the values are aggregated across pipelines.
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

### Metrics

There are two straightforward measurements that can be made on any pdata:

1. A count of "items" (spans, data points, or log records). These are low cost but broadly useful, so they should be enabled by default.
2. A measure of size, based on [ProtoMarshaler.Sizer()](https://github.com/open-telemetry/opentelemetry-collector/blob/9907ba50df0d5853c34d2962cf21da42e15a560d/pdata/ptrace/pb.go#L11). These are high cost to compute, so by default they should be disabled (and not calculated).
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

The location of these measurements can be described in terms of whether the data is "incoming" or "outgoing", from the perspective of the component to which the telemetry is ascribed.

1. Incoming measurements are attributed to the component which is _consuming_ the data.
2. Outgoing measurements are attributed to the component which is _producing_ the data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Challenge for this: Do we really need both? If ever component records this, the outgoing of component X will be equal with next component Y metrics.

Copy link
Member

@andrzej-stencel andrzej-stencel Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also considering this. Are there situations where outgoing metric of component A is not equal to incoming metric of component B? I think this is the same information. Does splitting this into two different metrics help in usage?

I was considering a "one metric per graph edge" approach, as described in the below examples.

The advantage of the "one metric per edge" approach is that with sufficiently elaborate pipelines, the amount of metrics is cut in half (asymptotically).

Is there information that cannot be expressed with this "one metric per edge" approach? Are there other disadvantages, e.g. usability?

Example 1: Simple pipeline with Filter processor
exporters:
  debug:
processors:
  filter:
    logs:
      log_record:
        - 'severity_number < SEVERITY_NUMBER_WARN'
receivers:
  filelog:
    include:
      - logs.json
service:
  pipelines:
    logs:
      exporters:
        - debug
      processors:
        - filter
      receivers:
        - filelog

Metrics according to this proposal:

otelcol_component_incoming_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 2
otelcol_component_outgoing_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 2
otelcol_component_incoming_items{otel.component.kind="processor",otel.component.id="filter",otel.signal="logs"} 2
otelcol_component_outgoing_items{otel.component.kind="processor",otel.component.id="filter",otel.signal="logs"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1

One metric per graph edge:

otelcol_component[_incoming]_items{otel[.destination].component.kind="receiver",otel[.destination].component.id="filelog",otel.signal="logs"} 2
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="processor",otel.destination.component.id="filter",otel.signal="logs"} 2
otelcol_component_items{otel.source.component.kind="processor",otel.source.component.id="filter",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="logs"} 1
otelcol_component[_outgoing]_items{otel.component.kind="exporter",otel.component.id="exporter",otel.signal="logs"} 1
Example 2: Pipeline with connector
connectors:
  otlpjson:
exporters:
  debug:
receivers:
  filelog:
    include:
      - otlp-log.json
      - otlp-metric.json
      - otlp-span.json
service:
  pipelines:
    logs/input:
      exporters: [otlpjson]
      receivers: [filelog]
    logs/otlp:
      exporters: [debug]
      receivers: [otlpjson]
    metrics/otlp:
      exporters: [debug]
      receivers: [otlpjson]
    traces/otlp:
      exporters: [debug]
      receivers: [otlpjson]

This proposal:

otelcol_component_incoming_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 3
otelcol_component_outgoing_items{otel.component.kind="receiver",otel.component.id="filelog",otel.signal="logs"} 3

otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->logs"} 3
otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->metrics"} 3
otelcol_component_incoming_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->traces"} 3
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->logs"} 1
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->metrics"} 1
otelcol_component_outgoing_items{otel.component.kind="connector",otel.component.id="otlpjson",otel.signal="logs->traces"} 1

otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="logs"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="logs"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="metrics"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="metrics"} 1
otelcol_component_incoming_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="traces"} 1
otelcol_component_outgoing_items{otel.component.kind="exporter",otel.component.id="debug",otel.signal="traces"} 1

One metric per edge:

otelcol_component[_incoming]_items{otel[.destination].component.kind="receiver",otel[.destination].component.id="filelog",otel.signal="logs"} 3

otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="logs"} 3
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="metrics"} 3
otelcol_component_items{otel.source.component.kind="receiver",otel.source.component.id="filelog",otel.destination.component.kind="connector",otel.destination.component.id="otlpjson",otel.signal="traces"} 3

otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="logs"} 1
otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="metrics"} 1
otelcol_component_items{otel.source.component.kind="connector",otel.source.component.id="otlpjson",otel.destination.component.kind="exporter",otel.destination.component.id="debug",otel.signal="traces"} 1

otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="logs"} 1
otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="metrics"} 1
otelcol_component[_outgoing]_items{otel[.source].component.kind="exporter",otel[.source].component.id="exporter",otel.signal="traces"} 1

Small note that the OTLP/JSON connector's behavior is currently actually different, see open-telemetry/opentelemetry-collector-contrib#35738. I've described the metrics as they would look like when that issue is resolved (connector does not emit empty batches).

Copy link
Member Author

@djaglowski djaglowski Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need both? If ever component records this, the outgoing of component X will be equal with next component Y metrics.

This is only true for linear pipelines. Any fanout or merge of data streams breaks this. e.g. Receiver A emits 10 items, receiver B emits 20, processor C consumes 30.

The advantage of the "one metric per edge" approach is that with sufficiently elaborate pipelines, the amount of metrics is cut in half (asymptotically).

Philosophically, I think users should reasonably expect that they can observe an object in isolation to get a relatively complete picture of how it is behaving. Offloading half of the picture for the sake of reducing data volume is a premature optimization in my opinion. In any case, if we support in some way the ability to disable individual metrics, then users can just disable one or the other to opt into exactly this same optimization.

More concretely, say we measure only incoming values. Some things that immediately become difficult for users:

  1. How much data is my filter processor discarding? I can find the incoming value easily, but in order to answer the question I also have to figure out which component(s) it is sending data to, query for their incoming values, aggregate those values, then compare to the incoming. I'm sure this is easier in some backends than others but I don't think users will often write queries for this - they will just accept a lack of visibility. The query is even more difficult if you want it to be resilient to changes in the configuration.
  2. How much data is my receiver emitting? Same convoluted answer as above.
  3. In a linear pipeline, what is the net effect of all my processors? Same convoluted answer as above (vs just comparing outgoing from the receiver to incoming of the exporter).

The same problems exists if you only measure outgoing. The problem are even worse if we actually pin metrics to edges by including attributes from both components.


For both metrics, an `outcome` attribute with possible values `success` and `failure` should be automatically recorded, corresponding to whether or not the function call returned an error. Outgoing measurements will be recorded with `outcome` as `failure` when the next consumer returns an error, and `success` otherwise. Likewise, incoming measurements will be recorded with `outcome` as `failure` when the component itself returns an error, and `success` otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which function call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For incoming, it is the call from the previous component to the described component's ConsumeX function. For outgoing, it is the call from the described component to the next component's ConsumeX function.

I'll clarify in the doc too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added clarification in the doc.


```yaml
otelcol_component_incoming_items:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call them "consumer", since consumer is about moving data, components only start/stop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelcol_component_consumed_items and otelcol_component_produced_items?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelcol_component_consumed_items and otelcol_component_produced_items?

I like these names

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an advantage to have the otel.component.kind in the name of the metric vs as a label:

  • The rule of thumb for having a label vs being in the name is if the sum across that label makes sense or not. I think actually it doesn't make that much sense.
  • The mechanism of disabling (you can still disable the calculation, but not the recording since you don't know which labels will be recorded) and avoiding computation is at "metric" level.

I think it may not be bad to have per component kind metrics. I would not necessary go to have per component kind and type level, but that can be another option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rule of thumb for having a label vs being in the name is if the sum across that label makes sense or not. I think actually it doesn't make that much sense.

I am on the fence about this. I agree the case is weak but aggregation along this dimension might at least give a sense of scale.

What I mean is that if you consider each component as doing some non-trivial amount of work for a amount of data, you can aggregate to understand relatively how much work the collector is doing internally. e.g. Say you are considering some changes to a configuration and compare v(n) to v(n+1). If you aggregate across all dimensions of "incoming" you may see that even with the same inputs the collector is having 10x items passed between components.

Either way, I'm not necessarily against having metrics per kind. Some consequences though:

  • To have consistent attributes with other signals, we should still apply the otel.signal attribute, even though this is redundant with the metric name.
  • It's more difficult for users to answer questions like, "which of the components is producing the most items" because they have to join across 3 metrics before sorting?

The mechanism of disabling (you can still disable the calculation, but not the recording since you don't know which labels will be recorded) and avoiding computation is at "metric" level.

I'm not sure I understand. Are you pointing out that users would be forced to enable/disable the metric for all kinds, but maybe they should be able to control it for each kind independently? e.g. Only record items produced by receivers and consumed by exporters?

enabled: true
description: Number of items passed to the component.
unit: "{items}"
sum:
value_type: int
monotonic: true
otelcol_component_outgoing_items:
enabled: true
description: Number of items emitted from the component.
unit: "{items}"
sum:
value_type: int
monotonic: true

otelcol_component_incoming_size:
enabled: false
description: Size of items passed to the component.
unit: "By"
sum:
value_type: int
monotonic: true
otelcol_component_outgoing_size:
enabled: false
description: Size of items emitted from the component.
unit: "By"
sum:
value_type: int
monotonic: true
Comment on lines +111 to +124
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that triggers another question to me:

  • To control if a metric is enabled and recorded is in the creation of the SDK. We will need a way for components (for this metrics as well) to communicate this with the service telemetry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not up to speed on the latest design but wouldn't it be intuitive that component.TelemetrySettings is able to answer the question of whether or not a metric is enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To control if a metric is enabled and recorded is in the creation of the SDK. We will need a way for components (for this metrics as well) to communicate this with the service telemetry.

This makes me think that the components will need to be able to return views (we've talked about this a few times).

The alternative here would be to only support enabling/disabling of metrics via view configuration.

Copy link
Member

@bogdandrutu bogdandrutu Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me think that the components will need to be able to return views (we've talked about this a few times).

Yes, that would be great. I think would be Factories though, since we need this before pipelines are created.

```

### Logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to capture details about logs that automatically report the status (ie started) of components in this RFC? I'm wondering if we should ensure that those logs include attributes listed earlier in this document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the mechanism I've described is able to do this. Also, since component status is still unstable, I'd prefer not to take that as a dependency here. My goal with this RFC is not to describe all standard telemetry for pipeline components, but to describe what can be done with the proposed mechanism.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this can be out of scope for now


Metrics provide most of the observability we need but there are some gaps which logs can fill. Although metrics would describe the overall item counts, it is helpful in some cases to record more granular events. e.g. If an outgoing batch of 10,000 spans results in an error, but 100 batches of 100 spans succeed, this may be a matter of batch size that can be detected by analyzing logs, while the corresponding metric reports only that a 50% success rate is observed.

For security and performance reasons, it would not be appropriate to log the contents of telemetry.

It's very easy for logs to become too noisy. Even if errors are occurring frequently in the data pipeline, they may only be of interest to many users if they are not handled automatically.

With the above considerations, this proposal includes only that we add a DEBUG log for each individual outcome. This should be sufficient for detailed troubleshooting but does not impact users otherwise.

In the future, it may be helpful to define triggers for reporting repeated failures at a higher severity level. e.g. N number of failures in a row, or a moving average success %. For now, the criteria and necessary configurability is unclear so this is mentioned only as an example of future possibilities.

### Spans

It is not clear that any spans can be captured automatically with the proposed mechanism. We have the ability to insert instrumentation both before and after processors and connectors. However, we generally cannot assume a 1:1 relationship between incoming and outgoing data.

### Additional context

This proposal pulls from a number of issues and PRs:

- [Demonstrate graph-based metrics](https://github.com/open-telemetry/opentelemetry-collector/pull/11311)
- [Attributes for component instancing](https://github.com/open-telemetry/opentelemetry-collector/issues/11179)
- [Simple processor metrics](https://github.com/open-telemetry/opentelemetry-collector/issues/10708)
- [Component instancing is complicated](https://github.com/open-telemetry/opentelemetry-collector/issues/10534)
Loading