Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that changing the Span async from OnStart is undefined behavior #2751

Closed
wants to merge 2 commits into from

Conversation

bogdandrutu
Copy link
Member

The reason to make this undefined behavior is because it is not guarantee that the span is still active after the call to OnStart finishes,
so instead of doing a very complicated synchronization logic which none of the SDK does, mark this as undefined behavior.

Signed-off-by: Bogdan Drutu bogdandrutu@gmail.com

@bogdandrutu bogdandrutu requested review from a team August 25, 2022 14:24
The reason to make this undefined behavior is because it is not guarantee that the span is still active after the call to OnStart finishes,
so instead of doing a very complicated synchronization logic which none of the SDK does, mark this as undefined behavior.

Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
Co-authored-by: Yuri Shkuro <yurishkuro@users.noreply.github.com>
@arminru arminru added area:sdk Related to the SDK spec:trace Related to the specification/trace directory labels Aug 25, 2022
Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should weaken that requirement.

so instead of doing a very complicated synchronization logic which none of the SDK does, mark this as undefined behavior.

What would that synchronization logic be? The SDK needs to do some synchronization logic for spans anyway because they are thread safe, and it is valid to use a span after it ended. We specified that the write operations will be no-op though.

I originally wanted to object that processors could use OnEnd to detect when the span ends, but once OnEnd is invoked the span is already ended, so that would be too late. Of course one can also check if the span has ended (there is an explicit requirement that this is queryable from a readable span), but that is transient information as well.

Still, for some use cases a "works most of the time" will be enough, e.g. to regularly add a stacktrace in an event on the span.

I think we don't need to add any further requirements to the spec, this is already expressed through the text on End:

Implementations SHOULD ignore all subsequent calls to End and any other Span methods, i.e. the Span becomes non-recording by being ended (there might be exceptions when Tracer is streaming events and has no mutable state associated with the Span).

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#end

I think it does makes sense to add a note that after OnStart returns the span can be ended at any time, and this will happen before OnEnd is invoked.

There is one contradictory statement in the spec under https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#span-operations

With the exception of the function to retrieve the Span's SpanContext and recording status, none of the below may be called after the Span is finished.

The "may" is lowercased, and it seems inconsistent with the spec part for Span.End quoted above. We should change one of these locations, probably based on what SDKs allow today. If we want to keep the ban on calling span methods "after the Span is finished" it would be nice to clarify what this means in combination with thread safety, i.e. I think this would make End thread un-safe.

@bogdandrutu
Copy link
Member Author

You are making a not always true assumption that the same "instance" is returned in the OnStart as well as to the instrumentation user via StartSpan, in terms of synchronization etc. OnStart happens during StartSpan so the access is exclusive to the SDK, which can pass a non thread safe implementation to the OnStart.

The SDK needs to do some synchronization logic for spans anyway because they are thread safe, and it is valid to use a span after it ended. We specified that the write operations will be no-op though.

False; The operation happens during StartSpan and access is exclusive to the SDK, prior to the moment is return to the instrumentation call.

I originally wanted to object that processors could use OnEnd to detect when the span ends, but once OnEnd is invoked the span is already ended, so that would be too late. Of course one can also check if the span has ended (there is an explicit requirement that this is queryable from a readable span), but that is transient information as well.
Still, for some use cases a "works most of the time" will be enough, e.g. to regularly add a stacktrace in an event on the span.

I don't see these use-cases, and again you are requiring to return a thread safe implementation if you do this. Also any implementation that "works most of the time" is a bad implementation, so we should not encourage that.

The "may" is lowercased, and it seems inconsistent with the spec part for Span.End quoted above. We should change one of these locations, probably based on what SDKs allow today. If we want to keep the ban on calling span methods "after the Span is finished" it would be nice to clarify what this means in combination with thread safety, i.e. I think this would make End thread un-safe.

Sure, we should fix the inconsistency in the specs. Undefined Behavior is for the instance passed as argument toOnStart not for the instance returned by StartSpan, even though they are implementing the same interface does not mean they have same requirements.

@Oberon00
Copy link
Member

Oberon00 commented Aug 26, 2022

You are making a not always true assumption that the same "instance" is returned in the OnStart as well as to the instrumentation user via StartSpan, in terms of synchronization etc.

I am indeed making this assumption, and I think all SDKs will have to implement it that way or at least something very close to it, because we have the following in the spec (note especially the second paragraph) at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#additional-span-interfaces:

Read/write span: A function receiving this as argument must have access to both the full span API as defined in the API-level definition for span's interface and additionally must be able to retrieve all information that was added to the span (as with readable span).

It MUST be possible for functions being called with this to somehow obtain the same Span instance and type that the span creation API returned (or will return) to the user (for example, the Span could be one of the parameters passed to such a function, or a getter could be provided).


False; The operation happens during StartSpan and access is exclusive to the SDK, prior to the moment is return to the instrumentation call.

Only the read-access is specific to the SDK, and that we need to have thread safe together with concurrent write access anyways because of the valid read-only use cases that are allowed even after this PR.

I don't see these use-cases,

I think the example I mentioned, attaching an event with the stacktrace regularly to every span, is a valid one (e.g. every 2 minutes to detect long-running spans).

Also any implementation that "works most of the time" is a bad implementation, so we should not encourage that.

Sure, we can even explicitly discourage it, but IMHO we should not make currently allowed valid use cases impossible (technically this would even be a breaking change to a stable API)

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Aug 26, 2022

Sure, we can even explicitly discourage it, but IMHO we should not make currently allowed valid use cases impossible (technically this would even be a breaking change to a stable API)

@Oberon00 how come is a breaking change? A UB does not mean code breaking, means undefined behavior, so the behavior if a user does that is not defined. Is it defined right now and we are changing that? I just proved that you defined it with an assumption in mind, which is not documented to be the case. So hence this behavior was not previously defined.

To understand where I am coming from, try to implement this using a non thread safe mechanism for write (not the way you have this implementation in mind), that also does not accept any more writes after the function ends? Is that a valid implementation that I have? It is, then my implementation will crash if users are doing what you want to do, so this is a definition of undefined behavior, if based on the current rules I can have 2 implementations one that crashes one that does not, users should be aware of that.

@Oberon00
Copy link
Member

Oberon00 commented Aug 29, 2022

I just proved that you defined it with an assumption in mind, which is not documented to be the case.

I just quoted the spec test that confirms my assumption:

It MUST be possible for functions being called with this [a read/write span] to somehow obtain the same Span instance and type that the span creation API returned (or will return) to the user (for example, the Span could be one of the parameters passed to such a function, or a getter could be provided).

-- https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#additional-span-interfaces

@Oberon00 Oberon00 dismissed their stale review August 30, 2022 12:32

I think it's technically a breaking change, but I don't actually feel this issue is very important either way, so I'm dismissing my request for changes.

@Oberon00
Copy link
Member

Oberon00 commented Aug 31, 2022

One follow-up comment: We should clarify (generally, maybe in the glossary) what we mean by "undefined behavior". My default assumption, coming from C++, is basically: "it's absolutely forbidden to go into this, even if it works now, it might cause security breaches, exceptions, or any other bad thing in the future".

@@ -491,6 +491,8 @@ exceptions.
SHOULD be reflected in it.
For example, this is useful for creating a SpanProcessor that periodically
evaluates/prints information about all active span from a background thread.
When a reference is kept, the span MUST be treated as a readonly object,
trying to modify the Span after `OnStart` method returns is undefined behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify if this also applies if one "somehow obtained the same Span instance and type that the span creation API returned", which is always possible according to the definition of read/write span.
I believe it cannot apply there (it could at most be unspecified if the span is already ended).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

github-actions bot commented Sep 8, 2022

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 8, 2022
@jsuereth jsuereth removed the Stale label Sep 8, 2022
@jsuereth
Copy link
Contributor

jsuereth commented Sep 8, 2022

@bogdandrutu Can you address @Oberon00's follow on comment?

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 16, 2022
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK spec:trace Related to the specification/trace directory Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants