Add spec for Dynamic Sampling Context #613

lforst · 2022-06-22T09:35:45Z

Todo

Resolves #611

After merging this PR:

Open issues for all the SDKs requiring changes for flattening the user. Please point out in the changelog which minimum Relay version the customers require for dynamic sampling to work, like here for example: https://github.com/getsentry/sentry-cocoa/blob/master/CHANGELOG.md#7130

vercel · 2022-06-22T09:35:48Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
develop	✅ Ready (Inspect)	Visit Preview	Jun 24, 2022 at 8:15PM (UTC)

marandaneto · 2022-06-22T09:59:52Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+Until now, traces sampling was only done through a `sample_rate` option in the SDKs.
+This has quite a few drawbacks for users of Sentry SDKs:


Isn't tracesSampleRate? if it's for traces.
sampleRate is only for events.

Yes sir you are correct. Thanks for pointing this out. Fixed in 6cdc569.

Nice, I'd look up the usage of this name in the docs tho, there are more references to sample_rate or sampleRate, such as sentry-samplerate.

Good point. I changed it to camel case and added links to the relevant sections in the docs (where it is also in camel case): 1a27b85

adinauer · 2022-06-22T10:14:37Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+If a `baggage` header already exists on an outgoing request, SDKs should aim to be good citizens by only **appending** Sentry values to the header.
+In the case that another vendor added Sentry values to an outgoing request, SDKs may overwrite those values.
+
+SDKs must not add other vendors' baggage from incoming requests to outgoing requests.


So devs have to add incoming baggage headers to outgoing requests themselves. If it's already there we just add our list-items to the outgoing baggage header(s). If no baggage header is present we add one with our list-items if our options tell us to do so, correct?

Yes correct, we already discussed this quite extensively within the JS SDK team. If users want to propagate baggage in general, they can propagate it themselves (or use libraries specifically for that). Sentry SDKs should only propagate sentry-* entries in the baggage header.

Mental model: Sentry SDKs are not trace-context/baggage propagation libraries - they are Sentry SDKs.

If anybody has strong opinions against this however, we can reopen the discussion on this.

adinauer · 2022-06-22T10:15:54Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+- `sentry-transaction` - The name of the trace's origin transaction in unparameterized (raw) format
+- `sentry-userid` - User ID as set by the user with <Link to="/sdk/unified-api/#scope">`scope.set_user`</Link>
+- `sentry-usersegment` - User segment as set by the user with <Link to="/sdk/unified-api/#scope">`scope.set_user`</Link>
+- `sentry-samplerate` - Sample rate as defined by the user in the SDK options


How should the sample rate be formatted? Do we have a max number of decimal points?

Good question, in the JS SDK, we're currently just calling toString on the sample rate number which caps it per default at 16 decimal points. Happy to change it if we decide to handle this uniformly across SDKs

Gave this a little more thought and as far as I can tell, the head SDK should set and propagate/send a format that in the end can be parsed by Relay without problems. I see no reason why downstream SDKs would have to convert a received sample rate string to a number, so we can probably disregard language specific concerns other than Rust.

Which brings me to the question what the constrains on the Relay side are (@jjbayer, any thoughts?). Should we e.g. agree on only sending/propagating the sample rate in "simple" decimal notation (i.e. no e.g. exponential notation such as 1.45e10-14) as proposed by @lforst ?

EDIT: @lforst just notified me that this was already discussed and we agreed on the proposal

src/docs/sdk/performance/dynamic-sampling-context.mdx

adinauer · 2022-06-22T10:23:59Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+
+Todo:
+
+- Why baggage and not trace context https://www.w3.org/TR/trace-context/?


Trace context is more restrictive in terms of size than baggage. Also baggage is more flexible in terms of encoding and characters according to our internal document on the decision.

src/docs/sdk/performance/dynamic-sampling-context.mdx

adinauer · 2022-06-22T10:46:14Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+### Envelope Header
+
+Dynamic Sampling Context is transferred to Sentry through the <Link to="/sdk/envelopes/#transaction">transaction envelope headers</Link>, keyed by `trace`.
+It corresponds directly to the definition of <Link to="/sdk/performance/trace-context/#trace-context">Trace Context</Link>.


What's the urgency on renaming TraceContext to Dynamic Sampling Context for SDKs?

I think those are still more or less different concepts. The word Dynamic Sampling Context represents the sentry-* values we propagate in the baggage header. The term TraceContext is incredibly overloaded - in the docs I just wrote it simply refers to the object schema over at https://develop.sentry.dev/sdk/performance/trace-context/#trace-context

I know in SDKs there are some fields called for exampleevent.contexts.trace but those fields are something else again. Right now we only care about Dynamic Sampling Context propagation in baggage and TraceContext in the event envelope header.

Edit: Lukas provided a better answer down below

I was referring to this comment #611 (comment) and comments following it.

Sorry to jump in here but what we proposed was to find a good term for the data we want to both, propagate to downstream SDKs (i.e. via the baggage Http header) and send to Relay (via the trace envelope header). That data should be the same, however, it is structured differently (as pointed out in the docs added by this PR).

The reasons for introducing "Dynamic Sampling Context" (DSC) were already mentioned: Trace Context is an overloaded term, we observed a lot of confusion around terms like "baggage", "trace state", "trace context", etc. and DSC should aim to unify this as much as possible. We're aware that Relay is currently calling the trace envelope header TraceContext. IMO it's not as important to change this right now but to rather have a term we can all agree on when talking about what we actually propagate/send.

Discussed this with @lforst and we agree on that

So DSC is what we use to internally to transport tracing information from an incoming request that may exist or the place it where the tracing information is created in the SDK (that's first in line) to the outgoing request (be that an API call or an envelope sent to Sentry).

Would it make sense to find a name for the "first in line" SDK to more easily refer to it? e.g. head SDK?

We propose that DSC is a term we use to describe the bag of key-value pairs which are propagated (baggage) and sent to relay (envelope header). So DSC contains:

the three "internal" items (trace id, public key, sample rate)

the five "external" items (environment, release, transaction name, user id, user segment)

So essentially, see it as a "meta interface" describing the stuff we propagate/send.

Maybe I misunderstood the question but it doesn't really have anything to do with the head transaction. Meaning, an SDK would get DSC via an incoming baggage header (if there exists one/the SDK is not the head SDK)

Would it make sense to find a name for the "first in line" SDK to more easily refer to it? e.g. head SDK?

Yes, very much in favour of "head SDK", "head transaction", etc. (as in "head" of trace)

adinauer · 2022-06-22T11:03:37Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+<Alert level="info">
+
+Other vendors might also be using the `baggage` header.
+If a `baggage` header already exists on an outgoing request, SDKs should aim to be good citizens by only **appending** Sentry values to the header.


For incoming baggage that already contains sentry- list-items:

We do not modify existing ones as we want to keep them the way they were sent by the first SDK in line, right?

Does that also mean we do not add missing list-items? Assuming userid was not set by the first SDK but another SDK knows it, should it add the list-item to baggage and other places or keep them the way they were sent by the first SDK?

As Dynamic Sampling Context is immutable across an entire trace, we cannot add additional sentry- list items when there already are ones on an incoming request. The reasoning for immutability is explained in the #ingest section of this doc - essentially relay needs to make exactly the same sampling decision for all individual transactions of a trace. This requires all transactions to have the exact same Dynamic Sampling Context.

I added the pseudo algorithm of how SDKs should instrument incoming and outgoing requests in regards to DSC. I hope that clarifies this a bit: 3f13024

Also added some sentences to explain this a little bit more explicitly: ec30942

Let’s also make it very clear that we should not delete or change existing non-sentry baggage values (and this this very very important).

@AbhiPrasad

we should not delete or change existing non-sentry baggage values

Does this mean that if an incoming baggage header is close to the limit of 8192 characters we are not allowed to add our sentry list-items to the header? Does that mean we should add them in an all or nothing fashion or do we have a priority for which of them we want to try and add until we run out of characters?

Does this mean that if an incoming baggage header is close to the limit of 8192 characters we are not allowed to add our sentry list-items to the header?

Yes we need to be good citizens and respect the standard here.

Does that mean we should add them in an all or nothing fashion or do we have a priority for which of them we want to try and add until we run out of characters?

I don't think we have a priority, we try to add what we can. I'd stay away from the all-or-nothing approach since that seems it might be confusing to users (not sure though). I'm comfortable enough with this for now since the vast majority of head SDK's (those creating the head transactions of a trace) will be from browser/mobile, which should not have problems with incoming baggage - only outgoing. As a result, we won't really hit this to start, and when if it ever becomes a substantial problem, I think we can come back and revisit. For now, let's just optimize for getting this out the door.

I don't think we have a priority, we try to add what we can

I agree - let's try to keep things simple for now. In case this really becomes a problem, we could still revisit this and discuss priorities of keys or other handling strategies.

For now, if we exceed the max length, we should log a warning though so that users/we are aware of why DSC might be propagated/sent incomplete in this case.
(Which is what we're currently doing in the JS SDKs).

Co-authored-by: Alexander Dinauer <adinauer@users.noreply.github.com>

adinauer · 2022-06-22T12:27:32Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+    # --> we don't propagate baggage for this trace
+    transaction.baggage_locked = true
+    transaction.baggage = {}
+  elif has_header(request, "baggage") and has_sentry_value_in_baggage_header(request):


How does has_sentry_value_in_baggage_header work?

Is there a specific list of list-items we need to be present for dynamic sampling to work?
I assume samplerate and publickey are required, are other fields as well?
Some of them are optional.

How does has_sentry_value_in_baggage_header work?

Added a pseudo implementation of the function to clarify: 0513094 - We simply check if there is a key in the baggage header that starts with "sentry-" or not.

Is there a specific list of list-items we need to be present for dynamic sampling to work?

There is no list yet but tried to clarify in 4a7508f. I believe for DS we need samplerate, publickey and traceid. However I should note that right here we don't care at all about what values are actually needed. If something is missing in an incoming request, we propagate DSC as is, and don't try to add stuff. I again wanna put emphasis on the fact that DSC cannot be mutated (by the origin application or any application down the line) as soon it has been propagated.

AbhiPrasad

Great start, this is awesome @lforst!

src/docs/sdk/performance/dynamic-sampling-context.mdx

AbhiPrasad · 2022-06-22T12:20:59Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+This has quite a few drawbacks for users of Sentry SDKs:
+
+- Changing the sampling rate involved either redeploying applications (which is problematic in case of applications that are not updated automatically, i.e., mobile apps or physically distributed software) or building complex systems to dynamically fetch a sampling rate.
+- Sampling only happened based on a factor of randomness.


Sampling is happening based on head based, probability sampling (in this case, simple random sampling)

Do you have a suggestion on how to reword this? I am kinda lost.

Let me think about this and try to push up a commit later today.

src/docs/sdk/performance/dynamic-sampling-context.mdx

AbhiPrasad · 2022-06-22T12:24:41Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+<Alert level="info">
+
+Other vendors might also be using the `baggage` header.
+If a `baggage` header already exists on an outgoing request, SDKs should aim to be good citizens by only **appending** Sentry values to the header.


Let’s also make it very clear that we should not delete or change existing non-sentry baggage values (and this this very very important).

AbhiPrasad · 2022-06-23T18:43:37Z

src/docs/sdk/performance/dynamic-sampling-context.mdx

+
+This section details some open questions and considerations that need to be addressed for dynamic sampling and the usage of the baggage propogation mechanism. These are not blockers to the adoption of the spec, but instead are here as context for future developments of the dynamic sampling product and spec.
+
+### The Temporal Problem


Would like some reviews on this section, not sure if I got the examples exactly correct.

#613 (review)

Lms24

Only reviewed the first example of the temporal problem section (i.e. user data):

I think the example shows the problem very well but we need to be careful if we're talking about DS decisions on the trace or on the individual transactions. To me, the explanations of the example are kinda in between both and they make it appear as if we were changing the DSC we store on the transaction after we receive the user data from user service. Which is not the case.

To my understanding, what would actually happen in this example (because of the immutability property of DSC) is

for trace DS decision: We would always be missing the user_id because it is not available at the first propagation (i.e. browser app -> user service via baggage). Even once the user data is available, we cannot update the DSC and hence, user data is never propagated to downstream services.
for individual transaction DS decision: According to the example, we know that the browser SDK puts the user data into the event that is sent to Sentry (not into the traceenv header though). Assuming that Relay looks at event.user for transaction DS decisions (which iirc it does but I might be off here), this transaction would be sampled on the user_id. The downstream transactions (user service, services A,B,C) would only be sampled by user_id, if they themselves set a user_id. They would never receive that information via baggage (in this example).

I hope this makes sense but we should rework the explanation so that a) differences are clearly visible to everyone and b) everyone is aware that the temporal problem might lead to user confusion because traces are sampled differently than potentially expected

src/docs/sdk/performance/dynamic-sampling-context.mdx

Dismissing just to unblock. Requested change can be revisited afterwards.

Co-authored-by: Lukas Stracke <lukas@stracke.co.at>

AbhiPrasad · 2022-06-24T20:14:10Z

@Lms24 I adjusted my wording to make it clear about the differences between trace and transaction DS, and made it clear in the temporal problem description that we were talking about trace based dynamic sampling.

for trace DS decision: We would always be missing the user_id because it is not available at the first propagation (i.e. browser app -> user service via baggage). Even once the user data is available, we cannot update the DSC and hence, user data is never propagated to downstream services.

This is correct, and essentially what I wanted to reflect. I'll borrow some of this wording to use.

I'm going to go ahead and merge this spec in the interest of unblocking all the SDK teams that are using it - though I would like to re-iterate the comment at the top of the spec:

This page is under active development.
Specifications are not final and subject to change.
Anything that sounds fishy probably is - nothing is set in stone.
Opening PRs to improve this page is therefore highly encouraged!

We can (and should be) re-visiting this as the dynamic sampling product evolves! Please continue to ask questions and leave feedback. I will create a follow up GH issue in this repo so we can track other discussion points that need to be documented like what is in #613 (comment)

Edit: GH issue here: #618

AbhiPrasad

v1 of the spec!

Add spec for Dynamic Sampling Context

503933e

lforst mentioned this pull request Jun 22, 2022

Add Dynamic Sampling Context + Baggage Spec #611

Closed

vercel bot deployed to Preview June 22, 2022 09:39 View deployment

marandaneto reviewed Jun 22, 2022

View reviewed changes

marandaneto requested review from adinauer, bruno-garcia and brustolin June 22, 2022 10:00

adinauer reviewed Jun 22, 2022

View reviewed changes

src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved

adinauer reviewed Jun 22, 2022

View reviewed changes

Add Unified Propagation Mechanism

3f13024

vercel bot deployed to Preview June 22, 2022 10:31 View deployment

adinauer reviewed Jun 22, 2022

View reviewed changes

src/docs/sdk/performance/dynamic-sampling-context.mdx Outdated Show resolved Hide resolved

adinauer reviewed Jun 22, 2022

View reviewed changes

Change sample_rate to traces_sample_rate

6cdc569

vercel bot deployed to Preview June 22, 2022 11:02 View deployment

adinauer reviewed Jun 22, 2022

View reviewed changes

Change "delimiter" to "prefix"

082ca1b

Co-authored-by: Alexander Dinauer <adinauer@users.noreply.github.com>

vercel bot deployed to Preview June 22, 2022 11:08 View deployment

Clarify that DSC cannot be altered after first propagation

ec30942

vercel bot deployed to Preview June 22, 2022 11:57 View deployment

Specify that we're using the trace envelope header

30d630c

vercel bot deployed to Preview June 22, 2022 12:18 View deployment

adinauer reviewed Jun 22, 2022

View reviewed changes

AbhiPrasad reviewed Jun 22, 2022

View reviewed changes

Clarify functionality of has_sentry_value_in_baggage_header

0513094

vercel bot deployed to Preview June 22, 2022 13:12 View deployment

Lms24 mentioned this pull request Jun 22, 2022

feat(tracing): Add additional Dynamic Sampling Context items to baggage and envelope headers getsentry/sentry-javascript#5292

Merged

6 tasks

Lms24 mentioned this pull request Jun 23, 2022

[DSC] Unify baggage and envelope header structure getsentry/sentry-javascript#5301

Closed

lforst added 2 commits June 23, 2022 16:46

Clarify format of sample_rate

85be8e7

Remove Trace Context docs

0c983c0

vercel bot deployed to Preview June 23, 2022 14:52 View deployment

add Considerations and Challenges section

2fc84fd

vercel bot deployed to Preview June 23, 2022 18:41 View deployment

AbhiPrasad reviewed Jun 23, 2022

View reviewed changes

remove baggage todo

5facd1e

vercel bot deployed to Preview June 23, 2022 18:52 View deployment

Put sentences in separate lines for better diffing

c2ec13e

vercel bot deployed to Preview June 24, 2022 08:13 View deployment

Lms24 reviewed Jun 24, 2022

View reviewed changes

adinauer mentioned this pull request Jun 24, 2022

Add sample rate to baggage as well as trace in envelope header and flatten user getsentry/sentry-java#2135

Merged

4 tasks

Stress importance of "freezing" DSC

cbf805f

vercel bot deployed to Preview June 24, 2022 12:28 View deployment

This was referenced Jun 24, 2022

Dynamic Sampling baggage header changes getsentry/sentry-mobile-release-health-app#241

Closed

Dynamic Sampling baggage header changes getsentry/team-mobile#23

Closed

lforst marked this pull request as ready for review June 24, 2022 14:28

Apply suggestions from code review

ee324c5

Co-authored-by: Lukas Stracke <lukas@stracke.co.at>

vercel bot deployed to Preview June 24, 2022 19:57 View deployment

Clarify trace behaviour in Temporal Problem

700e020

AbhiPrasad approved these changes Jun 24, 2022

View reviewed changes

AbhiPrasad enabled auto-merge (squash) June 24, 2022 20:15

vercel bot deployed to Preview June 24, 2022 20:15 View deployment

AbhiPrasad merged commit fbb25c6 into master Jun 24, 2022

AbhiPrasad deleted the lforst-dynamic-sampling-context branch June 24, 2022 20:16

AbhiPrasad mentioned this pull request Jun 24, 2022

[DS] Document dynamic sampling questions and considerations #618

Open

4 tasks

marandaneto mentioned this pull request Jun 27, 2022

Bump Sentry JavaScript 7.3.1 getsentry/sentry-react-native#2306

Merged

8 tasks

		Until now, traces sampling was only done through a `sample_rate` option in the SDKs.
		This has quite a few drawbacks for users of Sentry SDKs:


		Todo:

		- Why baggage and not trace context https://www.w3.org/TR/trace-context/?


		This section details some open questions and considerations that need to be addressed for dynamic sampling and the usage of the baggage propogation mechanism. These are not blockers to the adoption of the spec, but instead are here as context for future developments of the dynamic sampling product and spec.

		### The Temporal Problem

Add spec for Dynamic Sampling Context #613

Add spec for Dynamic Sampling Context #613

Conversation

lforst commented Jun 22, 2022 • edited by AbhiPrasad Loading

vercel bot commented Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lms24 Jun 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lforst Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lforst Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AbhiPrasad left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lms24 left a comment • edited Loading

Choose a reason for hiding this comment

AbhiPrasad commented Jun 24, 2022 • edited Loading

AbhiPrasad left a comment

Choose a reason for hiding this comment

lforst commented Jun 22, 2022 •

edited by AbhiPrasad

Loading

vercel bot commented Jun 22, 2022 •

edited

Loading

Lms24 Jun 23, 2022 •

edited

Loading

lforst Jun 22, 2022 •

edited

Loading

lforst Jun 22, 2022 •

edited

Loading

Lms24 left a comment •

edited

Loading

AbhiPrasad commented Jun 24, 2022 •

edited

Loading