fix(txnames): Revert high threshold for running the clusterer #49087

jjbayer · 2023-05-15T10:47:14Z

As part of https://github.com/getsentry/team-ingest/issues/93, we merged #46503, to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely.

However, we did not have any data to back up this decision, and there is no reason to impose this threshold from the algorithm's point of view: There is already the (lower) MERGE_THRESHOLD which should prevent low-quality replacement rules.

What we do know is that we've seen a decline in the number of transactions changed by clustering rules (see metric event.transaction_name_changes), which might be because we are now too strict about when we run the clusterer.

olksdr

lgtm!

codecov · 2023-05-15T11:18:18Z

Codecov Report

Merging #49087 (14ff455) into master (aaa7d66) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #49087   +/-   ##
=======================================
  Coverage   80.92%   80.92%           
=======================================
  Files        4819     4819           
  Lines      202008   202008           
  Branches    11412    11412           
=======================================
+ Hits       163467   163470    +3     
+ Misses      38287    38284    -3     
  Partials      254      254

Impacted Files	Coverage Δ
src/sentry/ingest/transaction_clusterer/tasks.py	`97.56% <100.00%> (ø)`

... and 3 files with indirect coverage changes

iker-barriocanal

to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely.

At that time, I played around with a subset of data of different sizes and different merge thresholds, and the ratio definitely influences the amount and quality of the rules. Given it's hard to get rid of a rule once generated, we decided to play safe here. This conclusion was taken from a small subset of data that isn't representative of the algorithm, but we should consider the impact on the quality.

What we do know is that we've seen a decline in the number of transactions [...] which might be because we are now too strict about when we run the clusterer.

It could also be because rules were disappearing over time, so fewer transactions were sanitized. Once we bump the rules properly, this could no longer be a problem.

I'm approving the PR because I don't have data to tell if this is a good or wrong decision, but we should monitor it.

jjbayer · 2023-05-15T13:18:16Z

It could also be because rules were disappearing over time, so fewer transactions were sanitized. Once we bump the rules properly, this could no longer be a problem.

That was the immediate cause, but shouldn't the cluster have immediately rediscovered those rules after they expired?

I'm approving the PR because I don't have data to tell if this is a good or wrong decision, but we should monitor it.

Thanks, I added some more metrics in recent days, will add them to the dashboard and configure some alerts.

Count a clusterer run whenever the clusterer spawn job is scheduled and a project has collected at least one transaction since the last run. Projects that have very few unique transaction names (i.e. low cardinality) should get those metrics tagged by the original transaction name, instead of being stuck in `<< unparameterized >>` forever. #49087 already lowered the threshold, but we're still seeing a high percentage of URL transactions not getting relabeled. ref: getsentry/team-ingest#124

As part of getsentry/team-ingest#93, we merged #46503, to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely. However, we did not have any data to back up this decision, and there is no reason to impose this threshold from the algorithm's point of view: There is already the (lower) `MERGE_THRESHOLD` which should prevent low-quality replacement rules. What we _do_ know is that we've seen a decline in the number of transactions changed by clustering rules (see metric `event.transaction_name_changes`), which might be because we are now too strict about when we run the clusterer.

Count a clusterer run whenever the clusterer spawn job is scheduled and a project has collected at least one transaction since the last run. Projects that have very few unique transaction names (i.e. low cardinality) should get those metrics tagged by the original transaction name, instead of being stuck in `<< unparameterized >>` forever. #49087 already lowered the threshold, but we're still seeing a high percentage of URL transactions not getting relabeled. ref: getsentry/team-ingest#124

jjbayer added 2 commits May 15, 2023 11:46

log

48d68bc

fix(txnames): Remove large threshold for starting clusterer

2ea1b70

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 15, 2023

jjbayer changed the title ~~Fix/txnames min names~~ fix(txnames): Revert high threshold for running the clusterer May 15, 2023

vercel bot deployed to Preview May 15, 2023 10:49 View deployment

Merge branch 'master' into fix/txnames-min-names

14ff455

jjbayer marked this pull request as ready for review May 15, 2023 11:13

jjbayer requested review from a team and iker-barriocanal May 15, 2023 11:13

olksdr approved these changes May 15, 2023

View reviewed changes

vercel bot deployed to Preview May 15, 2023 11:15 View deployment

iker-barriocanal approved these changes May 15, 2023

View reviewed changes

jjbayer merged commit b24987f into master May 15, 2023

jjbayer deleted the fix/txnames-min-names branch May 15, 2023 13:16

jjbayer mentioned this pull request May 23, 2023

fix(txnames): Always count clusterer run #49597

Merged

github-actions bot locked and limited conversation to collaborators May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(txnames): Revert high threshold for running the clusterer #49087

fix(txnames): Revert high threshold for running the clusterer #49087

jjbayer commented May 15, 2023 •

edited

Loading

olksdr left a comment

codecov bot commented May 15, 2023 •

edited

Loading

iker-barriocanal left a comment

jjbayer commented May 15, 2023

fix(txnames): Revert high threshold for running the clusterer #49087

fix(txnames): Revert high threshold for running the clusterer #49087

Conversation

jjbayer commented May 15, 2023 • edited Loading

olksdr left a comment

Choose a reason for hiding this comment

codecov bot commented May 15, 2023 • edited Loading

Codecov Report

iker-barriocanal left a comment

Choose a reason for hiding this comment

jjbayer commented May 15, 2023

jjbayer commented May 15, 2023 •

edited

Loading

codecov bot commented May 15, 2023 •

edited

Loading