Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mimir query engine: add memory consumption per query limit #8230

Merged
merged 26 commits into from
Jun 3, 2024

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented May 31, 2024

What this PR does

This PR implements a per-query estimated memory consumption limit for the Mimir query engine.

The estimate is based on the primary contributors to memory consumption: samples (eg. promql.FPoints) and running totals (eg. the slices of float64s used by sum() aggregations).

The estimate ignores other contributions to a query's memory consumption like chunks and series labels. These could be added in the future if need be.

The limit is enforced as slices are created during the query, and is based on the capacity of the slice created, not the size requested. These are not necessarily the same: we use bucketed pools for each of these slice types, and the pool will allocate a slice of capacity equal to the bucket that will hold the requested size, which will always be greater or equal to the requested size. This means the limit more closely tracks the actual memory utilisation of the query, but may be slightly higher than otherwise expected.

The estimate is generally accurate, except for:

  • For native histograms, we assume a fixed size per histogram sample (see nativeHistogramSampleSizeFactor in limiting_pool.go), as tracking the true size of each native histogram would be very expensive. A future improvement would be to make nativeHistogramSampleSizeFactor configurable, but I think this is fine for now.
  • The size of slice headers is ignored: the estimate only considers the memory used by slice elements.

Enforcing the limit adds up to 1% latency overhead to some benchmarks, but this seems worthwhile.

This change also required some shuffling of types between packages to help ensure that the underlying pools are not accessed directly and all allocations go through the limit-enforcing methods. In the interests of keeping this PR as small as possible, I haven't done all the refactoring I'd like to do and will do this in a future PR. In particular, I'd like to move the Operator interface and RingBuffer type to the types package, and rename the operator package to operators.

I'd also like to use the PeakEstimatedMemoryConsumptionBytes in a metric and log it on query traces, but this too will come in a follow-up PR.

Which issue(s) this PR fixes or relates to

(none)

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@charleskorn charleskorn marked this pull request as ready for review June 2, 2024 23:29
@charleskorn charleskorn requested review from jdbaldry and a team as code owners June 2, 2024 23:29
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome stuff, looks really good to me. I agree with the planned work (specifically exposing the metric), but makes sense to split it up like this.

@charleskorn charleskorn enabled auto-merge (squash) June 3, 2024 04:48
@charleskorn charleskorn merged commit 45a683e into main Jun 3, 2024
29 checks passed
@charleskorn charleskorn deleted the charleskorn/max-samples-limit branch June 3, 2024 05:00
charleskorn added a commit that referenced this pull request Jun 4, 2024
* Move interface definitions to `types` package

* Move operators to `operators` package

* Add changelog entry
narqo pushed a commit to narqo/grafana-mimir that referenced this pull request Jun 6, 2024
)

* Make formatting consistent

* Initial version of `LimitingPool`

* Move to `operator` package

* Move `FPoint` and `HPoint` pools next to `LimitingPool`

* Add methods for slices of `HPoint` to `LimitingPool`

* Use `LimitingPool` everywhere

* Move pool to its own package and introduce interface

* Move pool interface to `types` package

* Add documentation for `err-mimir-max-in-memory-samples-per-query`

* Add limit CLI flag and config option

* Add (failing) tests

* Fix linting warnings

* Add another test case

* Add more slice types to `LimitedPool`.

* Rework limit to use estimated memory consumption, rather than a number of samples

* Ensure float and bool slices are cleared.

* Update tests to use bytes rather than samples limit

* Add limit to list of experimental features

* Add changelog entry

* Fix linting warning

* Fix description of error

* Remove unnecessary interface and early enforcement of limit

* Fix flag name

* Remove unnecessary interface

* Remove unused methods

* Address PR feedback
narqo pushed a commit to narqo/grafana-mimir that referenced this pull request Jun 6, 2024
…afana#8247)

* Move interface definitions to `types` package

* Move operators to `operators` package

* Add changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants