Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the random fixed size exemplar reservoir #4852

Merged
merged 18 commits into from
Jan 29, 2024

Conversation

MrAlias
Copy link
Contributor

@MrAlias MrAlias commented Jan 24, 2024

Part of #559

This PR is split from #4455

This adds a Reservoir implementation that will randomly sample a specified number of measurements as exemplars. An explanation of the algorithm is copied from the included comment within the PR:

The following algorithm is "Algorithm L" from Li, Kim-Hung (4 December 1994). "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))". ACM Transactions on Mathematical Software. 20 (4): 481–493 (https://dl.acm.org/doi/10.1145/198429.198435).

A high-level overview of "Algorithm L":

  1. Pre-calculate the random count greater than the storage size when an exemplar will be replaced.
  2. Accept all measurements offered until the configured storage size is reached.
  3. Loop:
    a) When the pre-calculate count is reached, replace a random existing exemplar with the offered measurement.
    b) Calculate the next random count greater than the existing one which will replace another exemplars

The way a "replacement" count is computed is by looking at n number of independent random numbers each corresponding to an offered measurement. Of these numbers the smallest k (the same size as the storage capacity) of them are kept as a subset. The maximum value in this subset, called w is used to weight another random number generation for the next count that will be considered.

By weighting the next count computation like described, it is able to perform a uniformly-weighted sampling algorithm based on the number of samples the reservoir has seen so far. The sampling will "slow down" as more and more samples are offered so as to reduce a bias towards those offered just prior to the end of the collection.

This algorithm is preferred because of its balance of simplicity and performance. It will compute three random numbers (the bulk of computation time) for each item that becomes part of the reservoir, but it does not spend any time on items that do not. In particular it has an asymptotic runtime of O(k(1 + log(n/k)) where n is the number of measurements offered and k is the reservoir size.

See https://en.wikipedia.org/wiki/Reservoir_sampling for an overview of this and other reservoir sampling algorithms. See https://github.com/MrAlias/reservoir-sampling for a performance comparison of reservoir sampling algorithms.

@MrAlias MrAlias added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jan 24, 2024
@MrAlias MrAlias added this to the v1.23.0 milestone Jan 24, 2024
Copy link

codecov bot commented Jan 24, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (ce3faf1) 82.3% compared to head (bd024cb) 82.5%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #4852     +/-   ##
=======================================
+ Coverage   82.3%   82.5%   +0.1%     
=======================================
  Files        228     230      +2     
  Lines      18569   18754    +185     
=======================================
+ Hits       15300   15480    +180     
- Misses      2981    2985      +4     
- Partials     288     289      +1     
Files Coverage Δ
sdk/metric/internal/exemplar/storage.go 100.0% <100.0%> (ø)
sdk/metric/internal/exemplar/rand.go 97.7% <97.7%> (ø)

... and 1 file with indirect coverage changes

sdk/metric/internal/exemplar/rand.go Outdated Show resolved Hide resolved
sdk/metric/internal/exemplar/rand.go Outdated Show resolved Hide resolved
sdk/metric/internal/exemplar/storage.go Show resolved Hide resolved
sdk/metric/internal/exemplar/rand.go Outdated Show resolved Hide resolved
sdk/metric/internal/exemplar/rand.go Show resolved Hide resolved
sdk/metric/internal/exemplar/rand.go Show resolved Hide resolved
sdk/metric/internal/exemplar/rand.go Show resolved Hide resolved
Include a high-level overview of the algorithm implemented and clarify
parameter names to be consistent.
sdk/metric/internal/exemplar/storage.go Outdated Show resolved Hide resolved
sdk/metric/internal/exemplar/storage.go Show resolved Hide resolved
@MrAlias MrAlias merged commit dcfec0c into open-telemetry:main Jan 29, 2024
25 checks passed
@MrAlias MrAlias deleted the add-rand-fixed-size-res branch January 29, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Skip Changelog PRs that do not require a CHANGELOG.md entry
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants