Adding a means to denote preference for Link vs Span in given Context #4054

hibachrach · 2024-05-19T07:15:18Z

First time contributor here--please let me know if there's anything I need to correct procedurally!

What are you trying to achieve?

I'm hoping to extend the API description for Context as it pertains to Traces (or perhaps change the API description for Traces?)

Here's the problem I'm trying to solve:

Consider a user flow in which a file conversion service accepts an HTTP request. The user's intent is to upload a file to convert it from one file type to another (e.g. .pdf -> .png). For infrastructure performance reasons, this work is enqueued in a messaging system/background job processing system (e.g. Sidekiq in the Ruby ecosystem). As part of handling the same request, a similar message/background job is sent to the same system. This flow is depicted below:

graph TD
    A[Client] -->|Requests| B[Request Handler]
    B -->|Enqueues| C[Background Job for File Conversion]
    B -.->|Enqueues| D[Background Job for Analytics]
    style D stroke-dasharray: 5 5

From a user's perspective, only the solid-line nodes in the above graph are important. The dashed lines are merely an implementation detail. Similarly, only the Spans related to the solid-line nodes are, to me as the engineer monitoring this system, significant. If the analytics-related message is not complete for another hour, that does not matter to me very much.

This maps neatly to the notion of child Spans & Links: for a step in a flow where it is all really part of one high-level request, one would hope to have that represented as a child Span. It is truly just part of the overall Trace (or, at least, what I would expect a Trace to capture). On the other hand, the analytics message is relatively detached. While a Link could point me to its source, it is not something that is part of the user journey.

What did you expect to see?

Ideally, this is something that you'd be able to hint at via Context. Here's what the above would look like via pseudocode:

function requestHandler():
  Context.current.setPropagationStyleHintTo('Child'):
    BackgroundJobSystem.enqueue(FileConversionTask)
  BackgroundJobSystem.enqueue(AnalyticsTask)

This way, the Otel instrumentation for BackgroundJobSystem would know how to relate the next Span it creates to the current Span. This hint would reset within further nested spans (e.g. any tasks that FileConversionTask may enqueue would, by default, be connected via Links.

Some open questions that I don't have the answer to just yet:

What makes sense as the default propagation style? My guess is this would be defined by the instrumentation library. E.g. Sidekiq's uses Links by default.
~~Should an instrumentation library be permitted to disregard this hint?~~
- I would think yes: in the Sidekiq example, there would necessarily be the creation of both the PRODUCER and CONSUMER Spans and you'd want it to apply to both. So the instrumentation library needs to be given some flexibility in interpretation. Also, making it a requirement would necessarily be a breaking change I would think?

Additional context.

I was directed here from open-telemetry/opentelemetry-ruby-contrib#991 as a contributor thought this may make sense more generally.

The text was updated successfully, but these errors were encountered:

dmathieu · 2024-05-21T07:23:51Z

Something similar is already being done by instrumentation libraries.
For example, the WithPublicEndpoint option in Go:
https://github.com/open-telemetry/opentelemetry-go-contrib/blob/a91e60beace7c1d96302fe359050d705c4723e72/instrumentation/net/http/otelhttp/handler.go#L140-L146

Ruby has this proposal open for the Rack instrumentation: open-telemetry/opentelemetry-ruby-contrib#837

It seems your pseudo-code proposal would be the same as:

link = OpenTelemetry::Trace::Link.new(OpenTelemetry::Trace.current_span.context)
span = tracer.start_root_span(
  "FileConversion",
  with_parent: OpenTelemetry::Context.empty,
  links: [link]
)

It's a bit more verbose, but nothing a simple helper method can't resolve. Does it really need a spec change?

hibachrach · 2024-05-21T15:42:45Z

Hmm, perhaps I'm misunderstanding, but I don't think your code snippet isn't equivalent to what I'm describing. Your code allows one to sever the parent-child relationship between Spans that is present by default, but it doesn't allow for the opposite (converting what would be a Link by default into a parent-child relationship).

trask · 2024-05-21T15:45:01Z

cc @open-telemetry/semconv-messaging-approvers

pyohannes · 2024-05-22T08:33:16Z

@hibachrach If I understand correctly, you enqueue messages (or background jobs), and at enqueue time you want to specify whether the span for dequeuing the message (or running the background job) is created as a child or a link of the current span.

This effectively requires two parts:

A standardized way for propagating this decision or hint. For W3C, this might be part of the tracestate header.
On instrumentation side, producers need to populate this hint in the outgoing context, consumers need to extract and honor it.

This is definitely a valid use case, however, I'm wondering whether the need is widespread enough to merit a specification change and implementations in all languages. This issue is a great place to collect use cases and gauge interest.

Currently and without this capability, this needs to be solved on instrumentation side. In your case, the (sidekiq) instrumentation library could provide a configuration option to specify for which jobs you want to have parent/child relationships and for which you want to have links. I know of some instrumentation libraries (for example AWS Lambda for .NET) where such an option exists on a global level and allows switching between parent/child and link relationships.

lmolkova · 2024-05-22T18:33:11Z

Both static and context-based configuration for instrumentation are problematic
- Static configuration is not enough since some jobs are parented, others are linked.
- Context-based - there could be other spans nested under Context.current.setPropagationStyleHintTo('Child') which would be affected whether application author wants it or not

This sounds like a visualization concern rather than instrumentation/data collectopm issue.

@hibachrach could you please elaborate on

From a user's perspective, only the solid-line nodes in the above graph are important. The dashed lines are merely an implementation detail. Similarly, only the Spans related to the solid-line nodes are, to me as the engineer monitoring this system, significant. If the analytics-related message is not complete for another hour, that does not matter to me very much.

Could you please help me understand why is it important to use links (and not parent-child) for some of the background jobs:

is it a visual noise problem?
is it related to billing?
does it skew some additional analysis like e2e latency?
anything else?

hibachrach · 2024-05-23T04:25:42Z

@pyohannes

Currently and without this capability, this needs to be solved on instrumentation side. In your case, the (sidekiq) instrumentation library could provide a configuration option to specify for which jobs you want to have parent/child relationships and for which you want to have links

Yes, this was my first thought as well--see open-telemetry/opentelemetry-ruby-contrib#991

@lmolkova

Both static and context-based configuration for instrumentation are problematic

Static configuration is not enough since some jobs are parented, others are linked.

Context-based - there could be other spans nested under Context.current.setPropagationStyleHintTo('Child') which would be affected whether application author wants it or not

So context based solution would not reset it for the whole subtree, but merely for the immediate child. At the discretion of the instrumentation library author for that child node, it could continue propagating. Otherwise, it would reset.

This sounds like a visualization concern rather than instrumentation/data collectopm issue.

[...]

Could you please help me understand why is it important to use links (and not parent-child) for some of the background jobs:

is it a visual noise problem?

is it related to billing?

does it skew some additional analysis like e2e latency?

anything else?

I would say it's all of those items you listed:

It does add a lot of noise to a trace, though I know that wouldn't be a big issue with our tracing SaaS product of choice (Honeycomb)
AFAIK, there's not a way generally to ensure both ends of a link get chosen if using sampling. This effectively makes links useless for us. This is the primary reason I think this makes sense to tackle at the instrumentation/collection level.
As of today, we aren't able to do analysis on total trace duration (merely span duration) but my hope is that we'd be able to in the future. And yes, when that day comes, having "extra" subtrees in the trace would hamper that analysis.

dmathieu · 2024-05-23T07:06:30Z

So context based solution would not reset it for the whole subtree, but merely for the immediate child. At the discretion of the instrumentation library author for that child node, it could continue propagating. Otherwise, it would reset.

What if you want the setting to keep propagating, but the instrumentation author decided otherwise?

Yes, this was my first thought as well--see open-telemetry/opentelemetry-ruby-contrib#991

This issue is very relevant, and maybe it does entitle a specification change. But does it require an API change?
A specification change could be something that asks (official) instrumentation authors that their libraries may provide an option to propagate the parent span as a link rather than the default parent/child relationship.

cc @arielvalentin

lmolkova · 2024-05-23T17:51:33Z

thanks for the details @hibachrach

I do agree that all of these problems are good ones to solve, but I wonder if the proposed solution actually solves them in general case.

First, application need to customize parent/links relationships for specific spans in the system. It would improve the situation, in some case, but

visual noise can happen with parents or links
inconsistent sampling for links would force people to pick parent-child relationship and would increase the noise and skew e2e analysis
e2e latency would be a problem either way since only the specific application knows where the end is and which operations are fire-and-forget.

I'd suggest to tackle each of the problems individually and in a generic way (ask backends to improve links support/visualization, improve sampling with links, come up with ways to identify how to measure e2e latency, etc).

One more thing:

In many cases having links is not a choice, but a necessity - e.g. when pulling message from a queue, you don't know the context of the message(s) you're about to pull and can't use parent-child relationship. For the systems that don't have this problem, e.g. in-memory queues, parent-child may be a better default.

hibachrach added the spec:trace Related to the specification/trace directory label May 19, 2024

dyladan added the triage:deciding:community-feedback label May 21, 2024

lmolkova mentioned this issue Jul 9, 2024

Semantic convention for span link names open-telemetry/semantic-conventions#1057

Open

github-actions bot added the triage:followup label Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a means to denote preference for Link vs Span in given Context #4054

Adding a means to denote preference for Link vs Span in given Context #4054

hibachrach commented May 19, 2024 •

edited

Loading

dmathieu commented May 21, 2024 •

edited

Loading

hibachrach commented May 21, 2024

trask commented May 21, 2024

pyohannes commented May 22, 2024

lmolkova commented May 22, 2024 •

edited

Loading

hibachrach commented May 23, 2024

dmathieu commented May 23, 2024

lmolkova commented May 23, 2024 •

edited

Loading

Adding a means to denote preference for Link vs Span in given Context #4054

Adding a means to denote preference for Link vs Span in given Context #4054

Comments

hibachrach commented May 19, 2024 • edited Loading

dmathieu commented May 21, 2024 • edited Loading

hibachrach commented May 21, 2024

trask commented May 21, 2024

pyohannes commented May 22, 2024

lmolkova commented May 22, 2024 • edited Loading

hibachrach commented May 23, 2024

dmathieu commented May 23, 2024

lmolkova commented May 23, 2024 • edited Loading

hibachrach commented May 19, 2024 •

edited

Loading

dmathieu commented May 21, 2024 •

edited

Loading

lmolkova commented May 22, 2024 •

edited

Loading

lmolkova commented May 23, 2024 •

edited

Loading