Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/spanmetrics] Prune histograms #27083

Merged

Conversation

nijave
Copy link
Contributor

@nijave nijave commented Sep 23, 2023

Prune histograms when the dimension cache evictions are removed

Description:
Prunes histograms when the dimension cache is pruned. This prevents metric series from growing indefinitely

Link to tracking Issue:
#27080

Testing:
I modified the the existing test to check histograms length instead of dimensions cache length. This required simulating ticks to hit the exportMetrics function

Documentation:

@nijave nijave requested review from a team and dashpole September 23, 2023 01:07
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 23, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: nijave / name: Nick Venenga (926f2a3)
  • ✅ login: MovieStoreGuy / name: Sean Marciniak (3745bed)

Copy link
Contributor

@albertteoh albertteoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nijave!

Please be advised though that the spanmetricsprocessor is deprecated.

I believe the spanmetricsconnector does prune histograms, however, please let me know if I've misunderstood something.

@mfilipe
Copy link

mfilipe commented Sep 25, 2023

Thanks @nijave!

Please be advised though that the spanmetricsprocessor is deprecated.

I believe the spanmetricsconnector does prune histograms, however, please let me know if I've misunderstood something.

@albertteoh I don't think it is being pruned.

@MovieStoreGuy
Copy link
Contributor

Ahh, if this is related to memory management and concerns, then performing a delete on a map doesn't release the underlying storage until map is scheduled for GC.

The easiest way to address this is use pointer values for the value type since it keeps the memory allocated small.

lastly, this is a deprecated component you're trying to update. Could you please apply the fix to the connector version instead?

@nijave
Copy link
Contributor Author

nijave commented Sep 30, 2023

Ahh, if this is related to memory management and concerns, then performing a delete on a map doesn't release the underlying storage until map is scheduled for GC.

The easiest way to address this is use pointer values for the value type since it keeps the memory allocated small.

lastly, this is a deprecated component you're trying to update. Could you please apply the fix to the connector version instead?

It's less about memory management and more about controlling Prometheus active series which get emitted to long term storage. We're seeing about an extra ~1m active series pile up which is $8k/mon on Grafana Cloud. Actually, the original workaround was to set Grafana Agent to have a 2GiB memory limit on Kubernetes and let it get OOMKill'd before it could pile up too many metrics.

Usually what's happening for us, some errors come in and generate new series. They drop off but the series stick around until the agents are restarted which can be weeks or even months. We also have batch jobs and other periodic workloads that emit trace data occasionally but not regularly and these pile up as well.

We're using Grafana Agent which pulls this in as a library and we need to rewrite/test our config using their new config pattern if we want to use connector so, for us, we need to update our deprecated Grafana Agent config first.

If I have time, I'll take a look at the connector and see if it has the same problem. I'm actually switching jobs so after Oct 6th, I won't have a real environment available to test changes anymore.

@nijave
Copy link
Contributor Author

nijave commented Sep 30, 2023

Ahh, if this is related to memory management and concerns, then performing a delete on a map doesn't release the underlying storage until map is scheduled for GC.

That is interesting--I'm not very well versed in go (especially the GC quirks) so I'll have to do some reading up but I suspect that's not an issue for the use case I described above.

@MovieStoreGuy
Copy link
Contributor

@nijave I've rerun the tests, could I ask you to address any failures and provide a changelog?

Considering this is related to the Grafana, I'll cc @jpkrohling to see if they can also address this upstream as well :)

@nijave nijave force-pushed the nv-prune-spanmetrics-histograms branch 5 times, most recently from 1b3417a to 5633c3f Compare October 3, 2023 22:23
@nijave nijave changed the title Prune spanmetrics histograms [processor/spanmetrics] Prune histograms Oct 3, 2023
@nijave nijave force-pushed the nv-prune-spanmetrics-histograms branch from 5633c3f to 5946572 Compare October 3, 2023 22:27
@nijave
Copy link
Contributor Author

nijave commented Oct 3, 2023

@nijave I've rerun the tests, could I ask you to address any failures and provide a changelog?

Considering this is related to the Grafana, I'll cc @jpkrohling to see if they can also address this upstream as well :)

Ok, I think I addressed all the issues.

I have grafana/agent#5271 open on Grafana Agent (I think it's linked in the corresponding issue in this repo but not on this PR)

Prune histograms when the dimension cache evictions are removed
@nijave nijave force-pushed the nv-prune-spanmetrics-histograms branch from 5946572 to 926f2a3 Compare October 3, 2023 22:47
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: bug_fix
Copy link
Contributor Author

@nijave nijave Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure about this one. Technically this will produce less metric series than it previously did (the output of this component is changing) but my understanding was it was always intended that metric series (histograms) get pruned at the same time as dimensions cache map.

i.e. it was a bug these histograms weren't being pruned

@MovieStoreGuy MovieStoreGuy merged commit 6aac9bd into open-telemetry:main Oct 4, 2023
88 checks passed
@github-actions github-actions bot added this to the next release milestone Oct 4, 2023
@nijave nijave deleted the nv-prune-spanmetrics-histograms branch October 4, 2023 12:05
nijave added a commit to nijave/opentelemetry-collector-contrib that referenced this pull request Oct 4, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 open-telemetry#27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
@nijave
Copy link
Contributor Author

nijave commented Oct 6, 2023

Here's some numbers after running in our dev environment for a day with this change grafana/agent#5271 (comment)

jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this pull request Nov 12, 2023
Prune histograms when the dimension cache evictions are removed

**Description:**
Prunes histograms when the dimension cache is pruned. This prevents
metric series from growing indefinitely

**Link to tracking Issue:**
 open-telemetry#27080

**Testing:**
I modified the the existing test to check `histograms` length instead of
dimensions cache length. This required simulating ticks to hit the
exportMetrics function

**Documentation:** <Describe the documentation added.>

Co-authored-by: Sean Marciniak <30928402+MovieStoreGuy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
processor/spanmetrics Span Metrics processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants