-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(txnames): Revert high threshold for running the clusterer #49087
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #49087 +/- ##
=======================================
Coverage 80.92% 80.92%
=======================================
Files 4819 4819
Lines 202008 202008
Branches 11412 11412
=======================================
+ Hits 163467 163470 +3
+ Misses 38287 38284 -3
Partials 254 254
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely.
At that time, I played around with a subset of data of different sizes and different merge thresholds, and the ratio definitely influences the amount and quality of the rules. Given it's hard to get rid of a rule once generated, we decided to play safe here. This conclusion was taken from a small subset of data that isn't representative of the algorithm, but we should consider the impact on the quality.
What we do know is that we've seen a decline in the number of transactions [...] which might be because we are now too strict about when we run the clusterer.
It could also be because rules were disappearing over time, so fewer transactions were sanitized. Once we bump the rules properly, this could no longer be a problem.
I'm approving the PR because I don't have data to tell if this is a good or wrong decision, but we should monitor it.
That was the immediate cause, but shouldn't the cluster have immediately rediscovered those rules after they expired?
Thanks, I added some more metrics in recent days, will add them to the dashboard and configure some alerts. |
Count a clusterer run whenever the clusterer spawn job is scheduled and a project has collected at least one transaction since the last run. Projects that have very few unique transaction names (i.e. low cardinality) should get those metrics tagged by the original transaction name, instead of being stuck in `<< unparameterized >>` forever. #49087 already lowered the threshold, but we're still seeing a high percentage of URL transactions not getting relabeled. ref: getsentry/team-ingest#124
As part of getsentry/team-ingest#93, we merged #46503, to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely. However, we did not have any data to back up this decision, and there is no reason to impose this threshold from the algorithm's point of view: There is already the (lower) `MERGE_THRESHOLD` which should prevent low-quality replacement rules. What we _do_ know is that we've seen a decline in the number of transactions changed by clustering rules (see metric `event.transaction_name_changes`), which might be because we are now too strict about when we run the clusterer.
Count a clusterer run whenever the clusterer spawn job is scheduled and a project has collected at least one transaction since the last run. Projects that have very few unique transaction names (i.e. low cardinality) should get those metrics tagged by the original transaction name, instead of being stuck in `<< unparameterized >>` forever. #49087 already lowered the threshold, but we're still seeing a high percentage of URL transactions not getting relabeled. ref: getsentry/team-ingest#124
As part of https://github.com/getsentry/team-ingest/issues/93, we merged #46503, to ensure we would not run the clusterer for fresh projects until they collect a high amount of unique transaction names. This was based on a suspicion that we would otherwise declare all URL transactions as sanitized prematurely.
However, we did not have any data to back up this decision, and there is no reason to impose this threshold from the algorithm's point of view: There is already the (lower)
MERGE_THRESHOLD
which should prevent low-quality replacement rules.What we do know is that we've seen a decline in the number of transactions changed by clustering rules (see metric
event.transaction_name_changes
), which might be because we are now too strict about when we run the clusterer.