Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

Closed
wants to merge 13 commits into from
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ release.

### Traces

- Specify how span flags, trace context flags, and the tracestate
field are propagated and exported by trace
SDKs. ([#3924](https://github.com/open-telemetry/opentelemetry-specification/pull/3924))

### Metrics

- Remove implementation specific specification from metric API.
Expand Down
18 changes: 18 additions & 0 deletions specification/context/api-propagators.md
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,24 @@ Additional `Propagator`s implementing vendor-specific protocols such as AWS
X-Ray trace header protocol MUST NOT be maintained or distributed as part of
the Core OpenTelemetry repositories.

### W3C Trace Context Requirements

A W3C Trace Context propagator is expected to implement the
`traceparent` and `tracestate` contexts fields specified in [W3C Trace
Context Level 2](https://www.w3.org/TR/trace-context-2/).

When injecting and extracting trace context to or from a carrier, the
following fields are propagated.

- TraceID (16 bytes)
- SpanID (8 bytes)
- TraceFlags (8 bits)
- TraceState (string)

Propagators MUST NOT assume that bits 2-7 (6 most significant bits)
jmacd marked this conversation as resolved.
Show resolved Hide resolved
will be zero, as they are reserved for future use and are expected to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The W3C TraceContext Level 2 spec says that these must be set to zero: https://www.w3.org/TR/trace-context-2/#other-flags. It will be good to clarify this here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with https://www.w3.org/TR/trace-context-2/#a-traceparent-is-received

The vendor will only parse the trace-flags values supported by this version of this specification and ignore all other values. If parsing fails, the vendor creates a new traceparent header and deletes the tracestate. Vendors will set all unparsed / unknown trace-flags to 0 on outgoing requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, then, my entire PR is invalid.

This makes me question whether W3C specification is for the best, because it will be many ages before the new flag becomes useful. Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense. If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Josh. I checked up now the W3C TraceContext Level 1 spec (which is already in recommendation state since 2021) and it has the same wording for the unknown flags (https://www.w3.org/TR/trace-context/#other-flags). Agree this makes things harder for mixed-mode scenarios (where a participant is on an older version that doesn't understand the newer flag)....Adding @dyladan , @SergeyKanzhelev , @basti1302 in case they have more context behind that recommendation or inputs on how to address this...

Copy link

@basti1302 basti1302 Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1 because it was 1 in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.

I'll try to explain this with a concrete example. Let's assume for the sake of the argument that the spec had started without the sampled flag, and we would introduce the sampled flag now in version 2. The sampled flag is effectively a fact about the actions of the specific participant that sends the header downstream, it is not a fact about the trace as a whole. Its semantic when sending a 1 downstream is "I have sampled this request".

Imagine the following scenario:

  • Service A (which already supports the newer version and understands the sampled flag) sends sampled=1 to service B.
  • Service B implements the older version of the spec and it does not know about the sampled flag.
  • Service B decides to not sample this request (ultimately, each participant can make its own sampling decision).
  • If the rule would be "propagate unknown flags as received", service B would propagate sampled=1 downstream, although it did not sample. So service B would effectively claim something that is not true.

I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.

  1. We have flags that represent a fact about the actions of the most recent participant (like sampled).
  2. There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.

Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible. When the current rule (set unknown flags to 0 when sending the header downstream) was put in place, there was only the sampled flag. Maybe the assumption back then was that other flags would also be of type (1.), although that is pure speculation on my part (I wasn't around back then).

Some other remarks:

Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense.

I disagree with this. "Vendor" might not be the best term (and we have replaced it in more recent versions of the spec with "participant" I think), but every piece of software interacting with the traceparent header must follow the spec if it claims to be compliant, we cannot arbitrarily claim "oh, I'm not a vendor, I'll just do something different".

If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.

I fully understand your frustration with this. Indeed, it would take a while to get this spec level implemented across the board, and it will be less useful if only a subset of SDKs have implemented it. Ultimately, backwards compatibility is a difficult and complex topic :-(

In my opinion, there is basically a tradeoff between speed of adoption of new flags and correctness of propagated flags, and the spec choose to value the latter over the former.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I very much disagree that this is a defect in level 1. :-)

Copy link

@kalyanaj kalyanaj Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going by the criteria of mixed mode environments (which is the common case in a DT scenario). Ideally when a new capability (say a new flag) is introduced, we want whoever understands that newer capability to benefit from it. But in the current model, that's not possible, even if there's one older participant in the mix.

If it doesn't work in mixed mode environments, how do you see such new capabilities being adopted? Currently, the only way this can be used is for ALL the participants to upgrade to Level 2 of the spec - which may not be practical in environments with large number of participants.

Your earlier response on why the Level 1 spec might have modelled it that way (with the information available 3-4 years back) makes sense. And now we have the benefit of hindsight. I think what Josh/I are saying is the above f/w compat is a strong requirement, without which we cannot introduce new capabilities and expect adoption.

So, similar to what OTel is doing, maybe more bits are needed to express this and not just a single bit...

Happy to discuss more (either here and/or in the next W3C DT meeting).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1 because it was 1 in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.

But it feels even worse when a participant changes a flag from 1 to 0, without even knowing what it means.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.

  1. We have flags that represent a fact about the actions of the most recent participant (like sampled).
  2. There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.

Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible.

Indeed, so perhaps just to open a path for moving forward we can consider the following. We have 8 bits - how about splitting them into the above categories today, let's say 4+4 - before these bits are defined. This would still be a breaking change for the specs, of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand @jmacd's argument correctly, it is that when the random bit is 0, it is ambiguous if this means "not random" or if this means "old version of spec." The same ambiguity does not exist for the 1 state because the spec requires setting unknown flags to 0. Adding a second bit set to 1 would clarify that this is indeed generated by a level-2 generator but the random bit was explicitly not set. We discussed at one point adding a second bit for this purpose, but decided that the current situation where 0 is ambiguous is ok because "not random" and "maybe random" should be treated the same way.

I agree with @basti1302 that propagating unknown claims is potentially hazardous. I don't have anything to add to his arguments above.

propagate with the context.

### B3 Requirements

B3 has both single and multi-header encodings. It also has semantics that do not
Expand Down
5 changes: 4 additions & 1 deletion specification/trace/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,7 +221,10 @@ byte.

`TraceFlags` contain details about the trace. Unlike TraceState values,
TraceFlags are present in all traces. The current version of the specification
jmacd marked this conversation as resolved.
Show resolved Hide resolved
only supports a single flag called [sampled](https://www.w3.org/TR/trace-context/#sampled-flag).
supports two flags:

- [Sampled](https://www.w3.org/TR/trace-context/#sampled-flag)
- [Random](https://www.w3.org/TR/trace-context-2/#random-trace-id-flag)

`TraceState` carries vendor-specific trace identification data, represented as a list
of key-value pairs. TraceState allows multiple tracing
jmacd marked this conversation as resolved.
Show resolved Hide resolved
Expand Down
27 changes: 27 additions & 0 deletions specification/trace/sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,27 @@ Thus, the SDK specification defines sets of possible requirements for
(for example, the `Span` could be one of the parameters passed to such a function,
or a getter could be provided).

### Span flags

The OTLP representation for Span and Span Link include a 32-bit field
declared as Span Flags.

Bits 0-7 of the Span Flags field are reserved for the 8 bits of Trace
Context flags, specified in [W3C Trace
Context](https://www.w3.org/TR/trace-context-2/). [See the list of
recognized flags](./api.md#spancontext).

Bits 8 and 9 are defined to report the Remote property associated with
jmacd marked this conversation as resolved.
Show resolved Hide resolved
the SpanContext `IsRemote` property. SDKs should report this
information as follows:

- IsRemote = `true`: Bits 8 and 9 are set in the flags (i.e., `0x300`).
- IsRemote = `false`: Bits 8 is set in the flags (i.e., `0x100`).
jmacd marked this conversation as resolved.
Show resolved Hide resolved

For example, if the Span's incoming context has flags 0x3 (indicating
`Sampled` and `Random`) and the parent SpanContext `IsRemote`, the
jmacd marked this conversation as resolved.
Show resolved Hide resolved
resulting Span Flags will equal `0x303`.

## Sampling

Sampling is a mechanism to control the noise and overhead introduced by
Expand Down Expand Up @@ -295,6 +316,12 @@ be displayed on debug pages or in the logs. Example:

Description MUST NOT change over time and caller can cache the returned value.

### Span TraceState

The tracestate associated returned by the Sampler SHOULD be recorded
jmacd marked this conversation as resolved.
Show resolved Hide resolved
in the span's readable span data object and set in the corresponding field
when exported using OpenTelemetry's OTLP format.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

### Built-in samplers

OpenTelemetry supports a number of built-in samplers to choose from.
Expand Down
Loading