Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

Closed
wants to merge 13 commits into from

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 6, 2024

Part of #1826.
Part of #1413.

Changes

The W3C Trace Context Level 2 specification adds a new Random flag intended for use with Probability sampling. The Sampling SIG has produced an OTEP describing how it will be used and how probability sampling can be encoded in the tracestate field. See https://github.com/open-telemetry/oteps/blob/main/text/trace/0235-sampling-threshold-in-trace-state.md

Prototype semantic conventions: open-telemetry/semantic-conventions#793
Prototype collector tail sampler: open-telemetry/opentelemetry-collector-contrib#24811
Prototype sampler logic: open-telemetry/opentelemetry-collector-contrib#29720

Not adding a new spec-compliance matrix for this, we already have "Conforms to the W3C TraceContext spec" however this clarifies what that means. Some SDKs are out of compliance with this specification, but (IMO) not intentionally, so we should not create a new row of the matrix.

@jmacd jmacd requested review from a team March 6, 2024 00:45
@jmacd
Copy link
Contributor Author

jmacd commented Mar 6, 2024

As an example of what this is hoping to clarify, see open-telemetry/opentelemetry-go#5030

- TraceState (string)

Propagators MUST NOT assume that bits 2-7 (6 most significant bits)
will be zero, as they are reserved for future use and are expected to
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The W3C TraceContext Level 2 spec says that these must be set to zero: https://www.w3.org/TR/trace-context-2/#other-flags. It will be good to clarify this here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with https://www.w3.org/TR/trace-context-2/#a-traceparent-is-received

The vendor will only parse the trace-flags values supported by this version of this specification and ignore all other values. If parsing fails, the vendor creates a new traceparent header and deletes the tracestate. Vendors will set all unparsed / unknown trace-flags to 0 on outgoing requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe, then, my entire PR is invalid.

This makes me question whether W3C specification is for the best, because it will be many ages before the new flag becomes useful. Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense. If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Josh. I checked up now the W3C TraceContext Level 1 spec (which is already in recommendation state since 2021) and it has the same wording for the unknown flags (https://www.w3.org/TR/trace-context/#other-flags). Agree this makes things harder for mixed-mode scenarios (where a participant is on an older version that doesn't understand the newer flag)....Adding @dyladan , @SergeyKanzhelev , @basti1302 in case they have more context behind that recommendation or inputs on how to address this...

Copy link

@basti1302 basti1302 Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1 because it was 1 in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.

I'll try to explain this with a concrete example. Let's assume for the sake of the argument that the spec had started without the sampled flag, and we would introduce the sampled flag now in version 2. The sampled flag is effectively a fact about the actions of the specific participant that sends the header downstream, it is not a fact about the trace as a whole. Its semantic when sending a 1 downstream is "I have sampled this request".

Imagine the following scenario:

  • Service A (which already supports the newer version and understands the sampled flag) sends sampled=1 to service B.
  • Service B implements the older version of the spec and it does not know about the sampled flag.
  • Service B decides to not sample this request (ultimately, each participant can make its own sampling decision).
  • If the rule would be "propagate unknown flags as received", service B would propagate sampled=1 downstream, although it did not sample. So service B would effectively claim something that is not true.

I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.

  1. We have flags that represent a fact about the actions of the most recent participant (like sampled).
  2. There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.

Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible. When the current rule (set unknown flags to 0 when sending the header downstream) was put in place, there was only the sampled flag. Maybe the assumption back then was that other flags would also be of type (1.), although that is pure speculation on my part (I wasn't around back then).

Some other remarks:

Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense.

I disagree with this. "Vendor" might not be the best term (and we have replaced it in more recent versions of the spec with "participant" I think), but every piece of software interacting with the traceparent header must follow the spec if it claims to be compliant, we cannot arbitrarily claim "oh, I'm not a vendor, I'll just do something different".

If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.

I fully understand your frustration with this. Indeed, it would take a while to get this spec level implemented across the board, and it will be less useful if only a subset of SDKs have implemented it. Ultimately, backwards compatibility is a difficult and complex topic :-(

In my opinion, there is basically a tradeoff between speed of adoption of new flags and correctness of propagated flags, and the spec choose to value the latter over the former.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I very much disagree that this is a defect in level 1. :-)

Copy link

@kalyanaj kalyanaj Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going by the criteria of mixed mode environments (which is the common case in a DT scenario). Ideally when a new capability (say a new flag) is introduced, we want whoever understands that newer capability to benefit from it. But in the current model, that's not possible, even if there's one older participant in the mix.

If it doesn't work in mixed mode environments, how do you see such new capabilities being adopted? Currently, the only way this can be used is for ALL the participants to upgrade to Level 2 of the spec - which may not be practical in environments with large number of participants.

Your earlier response on why the Level 1 spec might have modelled it that way (with the information available 3-4 years back) makes sense. And now we have the benefit of hindsight. I think what Josh/I are saying is the above f/w compat is a strong requirement, without which we cannot introduce new capabilities and expect adoption.

So, similar to what OTel is doing, maybe more bits are needed to express this and not just a single bit...

Happy to discuss more (either here and/or in the next W3C DT meeting).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1 because it was 1 in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.

But it feels even worse when a participant changes a flag from 1 to 0, without even knowing what it means.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.

  1. We have flags that represent a fact about the actions of the most recent participant (like sampled).
  2. There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.

Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible.

Indeed, so perhaps just to open a path for moving forward we can consider the following. We have 8 bits - how about splitting them into the above categories today, let's say 4+4 - before these bits are defined. This would still be a breaking change for the specs, of course.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand @jmacd's argument correctly, it is that when the random bit is 0, it is ambiguous if this means "not random" or if this means "old version of spec." The same ambiguity does not exist for the 1 state because the spec requires setting unknown flags to 0. Adding a second bit set to 1 would clarify that this is indeed generated by a level-2 generator but the random bit was explicitly not set. We discussed at one point adding a second bit for this purpose, but decided that the current situation where 0 is ambiguous is ok because "not random" and "maybe random" should be treated the same way.

I agree with @basti1302 that propagating unknown claims is potentially hazardous. I don't have anything to add to his arguments above.

specification/context/api-propagators.md Outdated Show resolved Hide resolved
specification/trace/api.md Outdated Show resolved Hide resolved
specification/trace/api.md Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Mar 15, 2024
@jmacd jmacd removed the Stale label Mar 18, 2024
@jmacd jmacd requested a review from a team March 20, 2024 15:52
@jmacd
Copy link
Contributor Author

jmacd commented Mar 20, 2024

@kalyanaj I've updated this PR to reflect the reality of the W3C trace context specification regarding unknown flags. The PR itself is much less important, as a result, but still worth applying.

@dyladan dyladan self-requested a review March 26, 2024 12:50
specification/trace/sdk.md Outdated Show resolved Hide resolved
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 13, 2024
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
@jmacd jmacd changed the title Clarify how Trace Flags and Tracestate are handled by Trace SDKs Update Trace specification to use W3C Trace Context Level 2; add Random flag Apr 16, 2024
@jmacd jmacd closed this Apr 16, 2024
@jmacd
Copy link
Contributor Author

jmacd commented Apr 16, 2024

I may not re-open this; it serves very little purpose in its current form. It shows we need work on #1929.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants