Skip to content
This repository has been archived by the owner on Aug 14, 2024. It is now read-only.

Add spec for Dynamic Sampling Context #613

Merged
merged 28 commits into from
Jun 24, 2022

Conversation

lforst
Copy link
Member

@lforst lforst commented Jun 22, 2022

Todo

Resolves #611

After merging this PR:

@vercel
Copy link

vercel bot commented Jun 22, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
develop ✅ Ready (Inspect) Visit Preview Jun 24, 2022 at 8:15PM (UTC)

Comment on lines 13 to 14
Until now, traces sampling was only done through a `sample_rate` option in the SDKs.
This has quite a few drawbacks for users of Sentry SDKs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't tracesSampleRate? if it's for traces.
sampleRate is only for events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sir you are correct. Thanks for pointing this out. Fixed in 6cdc569.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'd look up the usage of this name in the docs tho, there are more references to sample_rate or sampleRate, such as sentry-samplerate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I changed it to camel case and added links to the relevant sections in the docs (where it is also in camel case): 1a27b85

If a `baggage` header already exists on an outgoing request, SDKs should aim to be good citizens by only **appending** Sentry values to the header.
In the case that another vendor added Sentry values to an outgoing request, SDKs may overwrite those values.

SDKs must not add other vendors' baggage from incoming requests to outgoing requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So devs have to add incoming baggage headers to outgoing requests themselves. If it's already there we just add our list-items to the outgoing baggage header(s). If no baggage header is present we add one with our list-items if our options tell us to do so, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct, we already discussed this quite extensively within the JS SDK team. If users want to propagate baggage in general, they can propagate it themselves (or use libraries specifically for that). Sentry SDKs should only propagate sentry-* entries in the baggage header.

Mental model: Sentry SDKs are not trace-context/baggage propagation libraries - they are Sentry SDKs.

If anybody has strong opinions against this however, we can reopen the discussion on this.

- `sentry-transaction` - The name of the trace's origin transaction in unparameterized (raw) format
- `sentry-userid` - User ID as set by the user with <Link to="/sdk/unified-api/#scope">`scope.set_user`</Link>
- `sentry-usersegment` - User segment as set by the user with <Link to="/sdk/unified-api/#scope">`scope.set_user`</Link>
- `sentry-samplerate` - Sample rate as defined by the user in the SDK options
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should the sample rate be formatted? Do we have a max number of decimal points?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, in the JS SDK, we're currently just calling toString on the sample rate number which caps it per default at 16 decimal points. Happy to change it if we decide to handle this uniformly across SDKs

Copy link
Member

@Lms24 Lms24 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a little more thought and as far as I can tell, the head SDK should set and propagate/send a format that in the end can be parsed by Relay without problems. I see no reason why downstream SDKs would have to convert a received sample rate string to a number, so we can probably disregard language specific concerns other than Rust.

Which brings me to the question what the constrains on the Relay side are (@jjbayer, any thoughts?). Should we e.g. agree on only sending/propagating the sample rate in "simple" decimal notation (i.e. no e.g. exponential notation such as 1.45e10-14) as proposed by @lforst ?

EDIT: @lforst just notified me that this was already discussed and we agreed on the proposal


Todo:

- Why baggage and not trace context https://www.w3.org/TR/trace-context/?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trace context is more restrictive in terms of size than baggage. Also baggage is more flexible in terms of encoding and characters according to our internal document on the decision.

### Envelope Header

Dynamic Sampling Context is transferred to Sentry through the <Link to="/sdk/envelopes/#transaction">transaction envelope headers</Link>, keyed by `trace`.
It corresponds directly to the definition of <Link to="/sdk/performance/trace-context/#trace-context">Trace Context</Link>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the urgency on renaming TraceContext to Dynamic Sampling Context for SDKs?

Copy link
Member Author

@lforst lforst Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those are still more or less different concepts. The word Dynamic Sampling Context represents the sentry-* values we propagate in the baggage header. The term TraceContext is incredibly overloaded - in the docs I just wrote it simply refers to the object schema over at https://develop.sentry.dev/sdk/performance/trace-context/#trace-context

I know in SDKs there are some fields called for exampleevent.contexts.trace but those fields are something else again. Right now we only care about Dynamic Sampling Context propagation in baggage and TraceContext in the event envelope header.

Edit: Lukas provided a better answer down below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to this comment #611 (comment) and comments following it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to jump in here but what we proposed was to find a good term for the data we want to both, propagate to downstream SDKs (i.e. via the baggage Http header) and send to Relay (via the trace envelope header). That data should be the same, however, it is structured differently (as pointed out in the docs added by this PR).

The reasons for introducing "Dynamic Sampling Context" (DSC) were already mentioned: Trace Context is an overloaded term, we observed a lot of confusion around terms like "baggage", "trace state", "trace context", etc. and DSC should aim to unify this as much as possible. We're aware that Relay is currently calling the trace envelope header TraceContext. IMO it's not as important to change this right now but to rather have a term we can all agree on when talking about what we actually propagate/send.

Discussed this with @lforst and we agree on that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So DSC is what we use to internally to transport tracing information from an incoming request that may exist or the place it where the tracing information is created in the SDK (that's first in line) to the outgoing request (be that an API call or an envelope sent to Sentry).

Would it make sense to find a name for the "first in line" SDK to more easily refer to it? e.g. head SDK?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We propose that DSC is a term we use to describe the bag of key-value pairs which are propagated (baggage) and sent to relay (envelope header). So DSC contains:

  • the three "internal" items (trace id, public key, sample rate)
  • the five "external" items (environment, release, transaction name, user id, user segment)

So essentially, see it as a "meta interface" describing the stuff we propagate/send.

Maybe I misunderstood the question but it doesn't really have anything to do with the head transaction. Meaning, an SDK would get DSC via an incoming baggage header (if there exists one/the SDK is not the head SDK)

Would it make sense to find a name for the "first in line" SDK to more easily refer to it? e.g. head SDK?

Yes, very much in favour of "head SDK", "head transaction", etc. (as in "head" of trace)

<Alert level="info">

Other vendors might also be using the `baggage` header.
If a `baggage` header already exists on an outgoing request, SDKs should aim to be good citizens by only **appending** Sentry values to the header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For incoming baggage that already contains sentry- list-items:

  • We do not modify existing ones as we want to keep them the way they were sent by the first SDK in line, right?
  • Does that also mean we do not add missing list-items? Assuming userid was not set by the first SDK but another SDK knows it, should it add the list-item to baggage and other places or keep them the way they were sent by the first SDK?

Copy link
Member Author

@lforst lforst Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Dynamic Sampling Context is immutable across an entire trace, we cannot add additional sentry- list items when there already are ones on an incoming request. The reasoning for immutability is explained in the #ingest section of this doc - essentially relay needs to make exactly the same sampling decision for all individual transactions of a trace. This requires all transactions to have the exact same Dynamic Sampling Context.

I added the pseudo algorithm of how SDKs should instrument incoming and outgoing requests in regards to DSC. I hope that clarifies this a bit: 3f13024

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added some sentences to explain this a little bit more explicitly: ec30942

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s also make it very clear that we should not delete or change existing non-sentry baggage values (and this this very very important).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbhiPrasad

we should not delete or change existing non-sentry baggage values

Does this mean that if an incoming baggage header is close to the limit of 8192 characters we are not allowed to add our sentry list-items to the header? Does that mean we should add them in an all or nothing fashion or do we have a priority for which of them we want to try and add until we run out of characters?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if an incoming baggage header is close to the limit of 8192 characters we are not allowed to add our sentry list-items to the header?

Yes we need to be good citizens and respect the standard here.

Does that mean we should add them in an all or nothing fashion or do we have a priority for which of them we want to try and add until we run out of characters?

I don't think we have a priority, we try to add what we can. I'd stay away from the all-or-nothing approach since that seems it might be confusing to users (not sure though). I'm comfortable enough with this for now since the vast majority of head SDK's (those creating the head transactions of a trace) will be from browser/mobile, which should not have problems with incoming baggage - only outgoing. As a result, we won't really hit this to start, and when if it ever becomes a substantial problem, I think we can come back and revisit. For now, let's just optimize for getting this out the door.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have a priority, we try to add what we can

I agree - let's try to keep things simple for now. In case this really becomes a problem, we could still revisit this and discuss priorities of keys or other handling strategies.

For now, if we exceed the max length, we should log a warning though so that users/we are aware of why DSC might be propagated/sent incomplete in this case.
(Which is what we're currently doing in the JS SDKs).

Co-authored-by: Alexander Dinauer <adinauer@users.noreply.github.com>
# --> we don't propagate baggage for this trace
transaction.baggage_locked = true
transaction.baggage = {}
elif has_header(request, "baggage") and has_sentry_value_in_baggage_header(request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does has_sentry_value_in_baggage_header work?

Is there a specific list of list-items we need to be present for dynamic sampling to work?
I assume samplerate and publickey are required, are other fields as well?
Some of them are optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does has_sentry_value_in_baggage_header work?

Added a pseudo implementation of the function to clarify: 0513094 - We simply check if there is a key in the baggage header that starts with "sentry-" or not.

Is there a specific list of list-items we need to be present for dynamic sampling to work?

There is no list yet but tried to clarify in 4a7508f. I believe for DS we need samplerate, publickey and traceid. However I should note that right here we don't care at all about what values are actually needed. If something is missing in an incoming request, we propagate DSC as is, and don't try to add stuff. I again wanna put emphasis on the fact that DSC cannot be mutated (by the origin application or any application down the line) as soon it has been propagated.

Copy link
Member

@AbhiPrasad AbhiPrasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start, this is awesome @lforst!

src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
This has quite a few drawbacks for users of Sentry SDKs:

- Changing the sampling rate involved either redeploying applications (which is problematic in case of applications that are not updated automatically, i.e., mobile apps or physically distributed software) or building complex systems to dynamically fetch a sampling rate.
- Sampling only happened based on a factor of randomness.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampling is happening based on head based, probability sampling (in this case, simple random sampling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a suggestion on how to reword this? I am kinda lost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this and try to push up a commit later today.

src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
<Alert level="info">

Other vendors might also be using the `baggage` header.
If a `baggage` header already exists on an outgoing request, SDKs should aim to be good citizens by only **appending** Sentry values to the header.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let’s also make it very clear that we should not delete or change existing non-sentry baggage values (and this this very very important).


This section details some open questions and considerations that need to be addressed for dynamic sampling and the usage of the baggage propogation mechanism. These are not blockers to the adoption of the spec, but instead are here as context for future developments of the dynamic sampling product and spec.

### The Temporal Problem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like some reviews on this section, not sure if I got the examples exactly correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@Lms24 Lms24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only reviewed the first example of the temporal problem section (i.e. user data):

I think the example shows the problem very well but we need to be careful if we're talking about DS decisions on the trace or on the individual transactions. To me, the explanations of the example are kinda in between both and they make it appear as if we were changing the DSC we store on the transaction after we receive the user data from user service. Which is not the case.

To my understanding, what would actually happen in this example (because of the immutability property of DSC) is

  • for trace DS decision: We would always be missing the user_id because it is not available at the first propagation (i.e. browser app -> user service via baggage). Even once the user data is available, we cannot update the DSC and hence, user data is never propagated to downstream services.
  • for individual transaction DS decision: According to the example, we know that the browser SDK puts the user data into the event that is sent to Sentry (not into the traceenv header though). Assuming that Relay looks at event.user for transaction DS decisions (which iirc it does but I might be off here), this transaction would be sampled on the user_id. The downstream transactions (user service, services A,B,C) would only be sampled by user_id, if they themselves set a user_id. They would never receive that information via baggage (in this example).

I hope this makes sense but we should rework the explanation so that a) differences are clearly visible to everyone and b) everyone is aware that the temporal problem might lead to user confusion because traces are sampled differently than potentially expected

src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved
@lforst lforst marked this pull request as ready for review June 24, 2022 14:28
@lforst lforst dismissed untitaker’s stale review June 24, 2022 14:30

Dismissing just to unblock. Requested change can be revisited afterwards.

Co-authored-by: Lukas Stracke <lukas@stracke.co.at>
@AbhiPrasad
Copy link
Member

AbhiPrasad commented Jun 24, 2022

@Lms24 I adjusted my wording to make it clear about the differences between trace and transaction DS, and made it clear in the temporal problem description that we were talking about trace based dynamic sampling.

for trace DS decision: We would always be missing the user_id because it is not available at the first propagation (i.e. browser app -> user service via baggage). Even once the user data is available, we cannot update the DSC and hence, user data is never propagated to downstream services.

This is correct, and essentially what I wanted to reflect. I'll borrow some of this wording to use.

I'm going to go ahead and merge this spec in the interest of unblocking all the SDK teams that are using it - though I would like to re-iterate the comment at the top of the spec:

This page is under active development.
Specifications are not final and subject to change.
Anything that sounds fishy probably is - nothing is set in stone.
Opening PRs to improve this page is therefore highly encouraged!

We can (and should be) re-visiting this as the dynamic sampling product evolves! Please continue to ask questions and leave feedback. I will create a follow up GH issue in this repo so we can track other discussion points that need to be documented like what is in #613 (comment)

Edit: GH issue here: #618

Copy link
Member

@AbhiPrasad AbhiPrasad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1 of the spec!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Dynamic Sampling Context + Baggage Spec
8 participants