Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(txnames): Revert high threshold for running the clusterer #49087

Merged
merged 3 commits into from
May 15, 2023

Conversation

jjbayer
Copy link
Member

@jjbayer jjbayer commented May 15, 2023

As part of https://github.com/getsentry/team-ingest/issues/93, we merged #46503, to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely.

However, we did not have any data to back up this decision, and there is no reason to impose this threshold from the algorithm's point of view: There is already the (lower) MERGE_THRESHOLD which should prevent low-quality replacement rules.

What we do know is that we've seen a decline in the number of transactions changed by clustering rules (see metric event.transaction_name_changes), which might be because we are now too strict about when we run the clusterer.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label May 15, 2023
@jjbayer jjbayer changed the title Fix/txnames min names fix(txnames): Revert high threshold for running the clusterer May 15, 2023
@jjbayer jjbayer marked this pull request as ready for review May 15, 2023 11:13
@jjbayer jjbayer requested review from a team and iker-barriocanal May 15, 2023 11:13
Copy link
Contributor

@olksdr olksdr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@codecov
Copy link

codecov bot commented May 15, 2023

Codecov Report

Merging #49087 (14ff455) into master (aaa7d66) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #49087   +/-   ##
=======================================
  Coverage   80.92%   80.92%           
=======================================
  Files        4819     4819           
  Lines      202008   202008           
  Branches    11412    11412           
=======================================
+ Hits       163467   163470    +3     
+ Misses      38287    38284    -3     
  Partials      254      254           
Impacted Files Coverage Δ
src/sentry/ingest/transaction_clusterer/tasks.py 97.56% <100.00%> (ø)

... and 3 files with indirect coverage changes

Copy link
Contributor

@iker-barriocanal iker-barriocanal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely.

At that time, I played around with a subset of data of different sizes and different merge thresholds, and the ratio definitely influences the amount and quality of the rules. Given it's hard to get rid of a rule once generated, we decided to play safe here. This conclusion was taken from a small subset of data that isn't representative of the algorithm, but we should consider the impact on the quality.

What we do know is that we've seen a decline in the number of transactions [...] which might be because we are now too strict about when we run the clusterer.

It could also be because rules were disappearing over time, so fewer transactions were sanitized. Once we bump the rules properly, this could no longer be a problem.

I'm approving the PR because I don't have data to tell if this is a good or wrong decision, but we should monitor it.

@jjbayer jjbayer merged commit b24987f into master May 15, 2023
@jjbayer jjbayer deleted the fix/txnames-min-names branch May 15, 2023 13:16
@jjbayer
Copy link
Member Author

jjbayer commented May 15, 2023

It could also be because rules were disappearing over time, so fewer transactions were sanitized. Once we bump the rules properly, this could no longer be a problem.

That was the immediate cause, but shouldn't the cluster have immediately rediscovered those rules after they expired?

I'm approving the PR because I don't have data to tell if this is a good or wrong decision, but we should monitor it.

Thanks, I added some more metrics in recent days, will add them to the dashboard and configure some alerts.

jjbayer added a commit that referenced this pull request May 24, 2023
Count a clusterer run whenever the clusterer spawn job is scheduled and
a project has collected at least one transaction since the last run.

Projects that have very few unique transaction names (i.e. low
cardinality) should get those metrics tagged by the original transaction
name, instead of being stuck in `<< unparameterized >>` forever.

#49087 already lowered the
threshold, but we're still seeing a high percentage of URL transactions
not getting relabeled.

ref: getsentry/team-ingest#124
volokluev pushed a commit that referenced this pull request May 30, 2023
As part of getsentry/team-ingest#93, we merged
#46503, to ensure we would not
run the clusterer for fresh projects until they collect a high amount of
unique transaction names. This was based on a suspicion that we would
otherwise declare all URL transactions as sanitized prematurely.

However, we did not have any data to back up this decision, and there is
no reason to impose this threshold from the algorithm's point of view:
There is already the (lower) `MERGE_THRESHOLD` which should prevent
low-quality replacement rules.

What we _do_ know is that we've seen a decline in the number of
transactions changed by clustering rules (see metric
`event.transaction_name_changes`), which might be because we are now too
strict about when we run the clusterer.
volokluev pushed a commit that referenced this pull request May 30, 2023
Count a clusterer run whenever the clusterer spawn job is scheduled and
a project has collected at least one transaction since the last run.

Projects that have very few unique transaction names (i.e. low
cardinality) should get those metrics tagged by the original transaction
name, instead of being stuck in `<< unparameterized >>` forever.

#49087 already lowered the
threshold, but we're still seeing a high percentage of URL transactions
not getting relabeled.

ref: getsentry/team-ingest#124
@github-actions github-actions bot locked and limited conversation to collaborators May 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants