Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref(metrics): Add normalization and update set metrics hashing #658

Merged
merged 7 commits into from
May 27, 2024

Conversation

elramen
Copy link
Contributor

@elramen elramen commented May 23, 2024

Add metrics normalization in accordance with the metrics developer docs.
Add metric name, unit, and tag truncation to adhere to the metrics user docs.
Add to_envelope for a single Metric instance to facilitate sending metrics from sentry-cli.
Change hash function to crc32 as described here, this ensures compatibility between different SDKs using the same metric.
Use clone_from instead of clone and clone_into instead of to_owned in a couple of places as suggested by clippy.

Copy link
Member

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks correct and has good test coverage, so no problem there.

My main concerns are:

  • performance, quite a bit so
  • dependencies

At the very least, you should follow the performance best practices of the regex crate here: https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop

I would consider that a blocker, as you are compiling the same regex over and over again in every single call.

Both for performance and dependencies: Do we really care that tags are limited to N graphemes? Why not unicode chars or even bytes? As in a bunch of cases (keys mainly?) we are filtering for ASCII chars anyway, iterating over graphemes is absolute overkill.

Otherwise, there is a bunch of String allocation happening all over the place which might be avoidable. Those are just a drop in the bucket compared to the regex topic mentioned above, so it might not be worth micro-optimizing the last drop of performance out of this.

Speaking of micro-optimization, this very much reminds me of the work I did recently in rust-lang/rust#121150 which I see is about to land today 🎉

sentry-core/src/metrics/mod.rs Show resolved Hide resolved
sentry-types/src/protocol/envelope.rs Outdated Show resolved Hide resolved
sentry-core/src/metrics/mod.rs Outdated Show resolved Hide resolved
@elramen
Copy link
Contributor Author

elramen commented May 24, 2024

The code looks correct and has good test coverage, so no problem there.

My main concerns are:

  • performance, quite a bit so
  • dependencies

At the very least, you should follow the performance best practices of the regex crate here: https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop

I would consider that a blocker, as you are compiling the same regex over and over again in every single call.

Both for performance and dependencies: Do we really care that tags are limited to N graphemes? Why not unicode chars or even bytes? As in a bunch of cases (keys mainly?) we are filtering for ASCII chars anyway, iterating over graphemes is absolute overkill.

Otherwise, there is a bunch of String allocation happening all over the place which might be avoidable. Those are just a drop in the bucket compared to the regex topic mentioned above, so it might not be worth micro-optimizing the last drop of performance out of this.

Speaking of micro-optimization, this very much reminds me of the work I did recently in rust-lang/rust#121150 which I see is about to land today 🎉

Thanks @Swatinem! I will optimize the regex usage. The graphemes can be removed for tag keys since, as you mentioned, we already filter away multi-byte characters. But for tag values, which allow the full UTF-8 character range, the code will panic if we truncate the tag value in the middle of a multi-byte character. Do you have any suggestion for how to truncate the tag value safely without graphemes? Maybe we can catch the panic and then reduce the truncation with 1 byte until it works?

@Swatinem
Copy link
Member

"catching panics" is not really a thing because people can (and often do) use panic = "abort", not to mention that it also has perf overhead.

floor/ceil_char_boundary exists in theory but is nightly only:
https://doc.rust-lang.org/std/primitive.str.html#method.floor_char_boundary
The docs also contain a nice example related to graphemes.

Depending on what our goal here is (bytes, chars or graphemes), you might as well just copy over the underlying implementation from std, or just iterate over chars() which I believe might be the simplest solution here.

@elramen
Copy link
Contributor Author

elramen commented May 24, 2024

"catching panics" is not really a thing because people can (and often do) use panic = "abort", not to mention that it also has perf overhead.

floor/ceil_char_boundary exists in theory but is nightly only: https://doc.rust-lang.org/std/primitive.str.html#method.floor_char_boundary The docs also contain a nice example related to graphemes.

Depending on what our goal here is (bytes, chars or graphemes), you might as well just copy over the underlying implementation from std, or just iterate over chars() which I believe might be the simplest solution here.

Great thanks!

Copy link
Member

@sl0thentr0py sl0thentr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but please wait for Arpad's approve to merge and release

@elramen elramen force-pushed the metrics-normalization branch 2 times, most recently from c08c0b4 to c8c0aeb Compare May 27, 2024 12:22
Copy link
Contributor

@loewenheim loewenheim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good. I have one criticism, and it's admittedly nitpicky: I don't think the From impls on NormalizedName/Tags/Unit are appropriate. Specifically, they are neither lossless nor value-preserving, as per https://doc.rust-lang.org/std/convert/trait.From.html#when-to-implement-from. I think free functions normalize_name/tags/unit exported from the normalization module would be better here.

@elramen elramen force-pushed the metrics-normalization branch 2 times, most recently from 90d0182 to 77f1a42 Compare May 27, 2024 12:39
@elramen
Copy link
Contributor Author

elramen commented May 27, 2024

@Swatinem Fixed the regex optimization, removed graphemes, and reduced the number of new string allocations 👍

@elramen
Copy link
Contributor Author

elramen commented May 27, 2024

Looks very good. I have one criticism, and it's admittedly nitpicky: I don't think the From impls on NormalizedName/Tags/Unit are appropriate. Specifically, they are neither lossless nor value-preserving, as per https://doc.rust-lang.org/std/convert/trait.From.html#when-to-implement-from. I think free functions normalize_name/tags/unit exported from the normalization module would be better here.

@loewenheim On it!

Copy link

codecov bot commented May 27, 2024

Codecov Report

Attention: Patch coverage is 89.57219% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 73.56%. Comparing base (6b83faa) to head (26e4893).
Report is 5 commits behind head on master.

Current head 26e4893 differs from pull request most recent head f3e7a57

Please upload reports for the commit f3e7a57 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #658      +/-   ##
==========================================
+ Coverage   73.09%   73.56%   +0.46%     
==========================================
  Files          62       66       +4     
  Lines        7448     7727     +279     
==========================================
+ Hits         5444     5684     +240     
- Misses       2004     2043      +39     

@elramen
Copy link
Contributor Author

elramen commented May 27, 2024

@loewenheim I removed the name and unit structs and just use a function instead now. For the tags, I export a function but keep the struct. Does this look ok?

@@ -31,9 +31,12 @@ UNSTABLE_cadence = ["dep:cadence", "UNSTABLE_metrics"]

[dependencies]
cadence = { version = "0.29.0", optional = true }
crc32fast = "1.4.0"
itertools = "0.13.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still using this?

Copy link
Contributor Author

@elramen elramen May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, crc32 is used for hashing set values and itertools is used for sorting and joining the metric tags! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to #659, we should make both these dependencies optional.
Also instead of manually sorting the metric tags, how about you switch to a BTreeMap which is sorted by definition?

@elramen elramen merged commit 73d04ae into master May 27, 2024
12 checks passed
@elramen elramen deleted the metrics-normalization branch May 27, 2024 15:10
@elramen elramen self-assigned this Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants