ref(metrics): Add normalization and update set metrics hashing #658

elramen · 2024-05-23T16:54:10Z

Add metrics normalization in accordance with the metrics developer docs.
Add metric name, unit, and tag truncation to adhere to the metrics user docs.
Add to_envelope for a single Metric instance to facilitate sending metrics from sentry-cli.
Change hash function to crc32 as described here, this ensures compatibility between different SDKs using the same metric.
Use clone_from instead of clone and clone_into instead of to_owned in a couple of places as suggested by clippy.

Swatinem

The code looks correct and has good test coverage, so no problem there.

My main concerns are:

performance, quite a bit so
dependencies

At the very least, you should follow the performance best practices of the regex crate here: https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop

I would consider that a blocker, as you are compiling the same regex over and over again in every single call.

Both for performance and dependencies: Do we really care that tags are limited to N graphemes? Why not unicode chars or even bytes? As in a bunch of cases (keys mainly?) we are filtering for ASCII chars anyway, iterating over graphemes is absolute overkill.

Otherwise, there is a bunch of String allocation happening all over the place which might be avoidable. Those are just a drop in the bucket compared to the regex topic mentioned above, so it might not be worth micro-optimizing the last drop of performance out of this.

Speaking of micro-optimization, this very much reminds me of the work I did recently in rust-lang/rust#121150 which I see is about to land today 🎉

sentry-core/src/metrics/mod.rs

sentry-types/src/protocol/envelope.rs

sentry-core/src/metrics/mod.rs

elramen · 2024-05-24T12:12:07Z

The code looks correct and has good test coverage, so no problem there.

My main concerns are:

performance, quite a bit so

dependencies

At the very least, you should follow the performance best practices of the regex crate here: https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop

I would consider that a blocker, as you are compiling the same regex over and over again in every single call.

Both for performance and dependencies: Do we really care that tags are limited to N graphemes? Why not unicode chars or even bytes? As in a bunch of cases (keys mainly?) we are filtering for ASCII chars anyway, iterating over graphemes is absolute overkill.

Otherwise, there is a bunch of String allocation happening all over the place which might be avoidable. Those are just a drop in the bucket compared to the regex topic mentioned above, so it might not be worth micro-optimizing the last drop of performance out of this.

Speaking of micro-optimization, this very much reminds me of the work I did recently in rust-lang/rust#121150 which I see is about to land today 🎉

Thanks @Swatinem! I will optimize the regex usage. The graphemes can be removed for tag keys since, as you mentioned, we already filter away multi-byte characters. But for tag values, which allow the full UTF-8 character range, the code will panic if we truncate the tag value in the middle of a multi-byte character. Do you have any suggestion for how to truncate the tag value safely without graphemes? Maybe we can catch the panic and then reduce the truncation with 1 byte until it works?

Swatinem · 2024-05-24T12:21:03Z

"catching panics" is not really a thing because people can (and often do) use panic = "abort", not to mention that it also has perf overhead.

floor/ceil_char_boundary exists in theory but is nightly only:
https://doc.rust-lang.org/std/primitive.str.html#method.floor_char_boundary
The docs also contain a nice example related to graphemes.

Depending on what our goal here is (bytes, chars or graphemes), you might as well just copy over the underlying implementation from std, or just iterate over chars() which I believe might be the simplest solution here.

elramen · 2024-05-24T12:24:13Z

"catching panics" is not really a thing because people can (and often do) use panic = "abort", not to mention that it also has perf overhead.

floor/ceil_char_boundary exists in theory but is nightly only: https://doc.rust-lang.org/std/primitive.str.html#method.floor_char_boundary The docs also contain a nice example related to graphemes.

Depending on what our goal here is (bytes, chars or graphemes), you might as well just copy over the underlying implementation from std, or just iterate over chars() which I believe might be the simplest solution here.

Great thanks!

sl0thentr0py

lgtm, but please wait for Arpad's approve to merge and release

loewenheim

Looks very good. I have one criticism, and it's admittedly nitpicky: I don't think the From impls on NormalizedName/Tags/Unit are appropriate. Specifically, they are neither lossless nor value-preserving, as per https://doc.rust-lang.org/std/convert/trait.From.html#when-to-implement-from. I think free functions normalize_name/tags/unit exported from the normalization module would be better here.

elramen · 2024-05-27T12:55:28Z

@Swatinem Fixed the regex optimization, removed graphemes, and reduced the number of new string allocations 👍

elramen · 2024-05-27T13:13:37Z

Looks very good. I have one criticism, and it's admittedly nitpicky: I don't think the From impls on NormalizedName/Tags/Unit are appropriate. Specifically, they are neither lossless nor value-preserving, as per https://doc.rust-lang.org/std/convert/trait.From.html#when-to-implement-from. I think free functions normalize_name/tags/unit exported from the normalization module would be better here.

@loewenheim On it!

codecov · 2024-05-27T14:01:55Z

Codecov Report

Attention: Patch coverage is 89.57219% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 73.56%. Comparing base (6b83faa) to head (26e4893).
Report is 5 commits behind head on master.

❗ Current head 26e4893 differs from pull request most recent head f3e7a57

Please upload reports for the commit f3e7a57 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #658      +/-   ##
==========================================
+ Coverage   73.09%   73.56%   +0.46%     
==========================================
  Files          62       66       +4     
  Lines        7448     7727     +279     
==========================================
+ Hits         5444     5684     +240     
- Misses       2004     2043      +39

elramen · 2024-05-27T14:02:29Z

@loewenheim I removed the name and unit structs and just use a function instead now. For the tags, I export a function but keep the struct. Does this look ok?

Swatinem · 2024-05-27T14:52:01Z

sentry-core/Cargo.toml

@@ -31,9 +31,12 @@ UNSTABLE_cadence = ["dep:cadence", "UNSTABLE_metrics"]

 [dependencies]
 cadence = { version = "0.29.0", optional = true }
+crc32fast = "1.4.0"
+itertools = "0.13.0"


are we still using this?

Yes, crc32 is used for hashing set values and itertools is used for sorting and joining the metric tags! :)

similar to #659, we should make both these dependencies optional.
Also instead of manually sorting the metric tags, how about you switch to a BTreeMap which is sorted by definition?

elramen requested review from Swatinem, sl0thentr0py and stephanie-anderson May 23, 2024 16:54

ref(metrics): Add normalization and update set metrics hashing

801d7ca

elramen force-pushed the metrics-normalization branch from 5f41028 to 801d7ca Compare May 24, 2024 10:49

ref(core): Use clone_from instead of clone as suggested by clippy

f15f559

Swatinem reviewed May 24, 2024

View reviewed changes

sentry-core/src/metrics/mod.rs Show resolved Hide resolved

sentry-types/src/protocol/envelope.rs Outdated Show resolved Hide resolved

sentry-core/src/metrics/mod.rs Outdated Show resolved Hide resolved

Elias Ram added 2 commits May 24, 2024 13:18

ref(tracing): Use clone_into instead of to_owned as suggested by clippy

92396ea

fixed to_envelope value

b383fb6

enable manually added timestamp in to_envelope

493d0ff

sl0thentr0py approved these changes May 27, 2024

View reviewed changes

elramen force-pushed the metrics-normalization branch 2 times, most recently from c08c0b4 to c8c0aeb Compare May 27, 2024 12:22

loewenheim reviewed May 27, 2024

View reviewed changes

elramen force-pushed the metrics-normalization branch 2 times, most recently from 90d0182 to 77f1a42 Compare May 27, 2024 12:39

Optimize regex and string allocation

835f025

elramen force-pushed the metrics-normalization branch from 77f1a42 to 835f025 Compare May 27, 2024 12:49

elramen requested review from Swatinem and loewenheim May 27, 2024 12:58

elramen force-pushed the metrics-normalization branch from cab00f7 to 26e4893 Compare May 27, 2024 13:59

elramen force-pushed the metrics-normalization branch from 26e4893 to 75179bf Compare May 27, 2024 14:07

loewenheim approved these changes May 27, 2024

View reviewed changes

removed From trait

f3e7a57

elramen force-pushed the metrics-normalization branch from 75179bf to f3e7a57 Compare May 27, 2024 14:25

Swatinem approved these changes May 27, 2024

View reviewed changes

elramen merged commit 73d04ae into master May 27, 2024
12 checks passed

elramen deleted the metrics-normalization branch May 27, 2024 15:10

dbanty mentioned this pull request May 27, 2024

Make regex an optional dependency #659

Merged

elramen self-assigned this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ref(metrics): Add normalization and update set metrics hashing #658

ref(metrics): Add normalization and update set metrics hashing #658

elramen commented May 23, 2024 •

edited

Loading

Swatinem left a comment

elramen commented May 24, 2024

Swatinem commented May 24, 2024

elramen commented May 24, 2024

sl0thentr0py left a comment

loewenheim left a comment •

edited

Loading

elramen commented May 27, 2024

elramen commented May 27, 2024 •

edited

Loading

codecov bot commented May 27, 2024 •

edited

Loading

elramen commented May 27, 2024 •

edited

Loading

Swatinem May 27, 2024

elramen May 27, 2024 •

edited

Loading

Swatinem May 28, 2024

ref(metrics): Add normalization and update set metrics hashing #658

ref(metrics): Add normalization and update set metrics hashing #658

Conversation

elramen commented May 23, 2024 • edited Loading

Swatinem left a comment

Choose a reason for hiding this comment

elramen commented May 24, 2024

Swatinem commented May 24, 2024

elramen commented May 24, 2024

sl0thentr0py left a comment

Choose a reason for hiding this comment

loewenheim left a comment • edited Loading

Choose a reason for hiding this comment

elramen commented May 27, 2024

elramen commented May 27, 2024 • edited Loading

codecov bot commented May 27, 2024 • edited Loading

Codecov Report

elramen commented May 27, 2024 • edited Loading

Swatinem May 27, 2024

Choose a reason for hiding this comment

elramen May 27, 2024 • edited Loading

Choose a reason for hiding this comment

Swatinem May 28, 2024

Choose a reason for hiding this comment

elramen commented May 23, 2024 •

edited

Loading

loewenheim left a comment •

edited

Loading

elramen commented May 27, 2024 •

edited

Loading

codecov bot commented May 27, 2024 •

edited

Loading

elramen commented May 27, 2024 •

edited

Loading

elramen May 27, 2024 •

edited

Loading