Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming PromQL engine: add support for range vector selectors as the top-level expression #8023

Merged
merged 22 commits into from
May 9, 2024

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented May 2, 2024

What this PR does

This PR adds support for range vector selectors (eg. my_metric[5m]) as the top-level expression when using the streaming PromQL engine.

As part of this PR, I've split the existing RangeVectorSelectorWithTransformation type that performed a rate() calculation over a range vector selector into two separate types: RangeVectorSelector and RangeVectorFunction (which still only supports the rate() function). Despite the addition of an extra level of indirection through an interface, this did not dramatically alter the benchmark results for queries that use rate().

RangeVectorSelector is the first operator in the streaming engine to return a range vector (rather than an instant vector), so I'm keen to hear any feedback or opinions on the overall design and RangeVectorOperator interface.

Benchmark results:

goos: darwin
goarch: arm64
pkg: github.com/grafana/mimir/pkg/streamingpromql/benchmarks
                                │      standard      │             streaming             │
                                │       sec/op       │   sec/op     vs base              │
Query/a_1[1m],_instant_query             136.6µ ± 7%   131.0µ ± 4%       ~ (p=0.093 n=6)
Query/a_100[1m],_instant_query           551.5µ ± 6%   563.1µ ± 4%       ~ (p=0.310 n=6)
Query/a_2000[1m],_instant_query          6.253m ± 2%   6.805m ± 1%  +8.82% (p=0.002 n=6)
geomean                                  778.0µ        794.7µ       +2.15%

                                │      standard      │              streaming              │
                                │        B/op        │     B/op      vs base               │
Query/a_1[1m],_instant_query            21.08Ki ± 0%   17.26Ki ± 0%  -18.08% (p=0.002 n=6)
Query/a_100[1m],_instant_query          149.3Ki ± 0%   122.6Ki ± 0%  -17.89% (p=0.002 n=6)
Query/a_2000[1m],_instant_query         2.573Mi ± 0%   2.455Mi ± 1%   -4.59% (p=0.002 n=6)
geomean                                 202.4Ki        174.6Ki       -13.74%

                                │      standard      │             streaming              │
                                │     allocs/op      │  allocs/op   vs base               │
Query/a_1[1m],_instant_query              373.0 ± 0%    304.0 ± 0%  -18.50% (p=0.002 n=6)
Query/a_100[1m],_instant_query           1.995k ± 0%   1.912k ± 0%   -4.16% (p=0.002 n=6)
Query/a_2000[1m],_instant_query          32.82k ± 0%   32.72k ± 0%   -0.28% (p=0.002 n=6)
geomean                                  2.901k        2.669k        -7.99%

                                │      standard      │             streaming              │
                                │         B          │      B        vs base              │
Query/a_1[1m],_instant_query            72.63Mi ± 1%   72.12Mi ± 1%       ~ (p=0.310 n=6)
Query/a_100[1m],_instant_query          66.54Mi ± 1%   66.07Mi ± 1%       ~ (p=0.126 n=6)
Query/a_2000[1m],_instant_query         69.13Mi ± 1%   75.38Mi ± 1%  +9.03% (p=0.002 n=6)
geomean                                 69.39Mi        71.08Mi       +2.44%

The slight increase in latency and peak memory utilisation is inline with what we've seen for other test cases that return many series. This can likely be optimised further, but given the absolute difference is small, I'm not too concerned about this and would like to delay any further optimisation to a future PR - I want to keep this PR as simple as possible given the amount of refactoring already required.

Which issue(s) this PR fixes or relates to

(none)

Checklist

  • Tests updated.
  • [n/a] Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [n/a] about-versioning.md updated with experimental features.

@charleskorn charleskorn force-pushed the charleskorn/streaming-promql-range-vector-selector branch from d6e34a3 to bd6c902 Compare May 2, 2024 06:54
@charleskorn charleskorn marked this pull request as ready for review May 2, 2024 07:10
@charleskorn charleskorn requested a review from a team as a code owner May 2, 2024 07:10
@jhesketh jhesketh self-requested a review May 3, 2024 03:55
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice job! 👏 I left few minor comments, but just looking at the code diff I can't see anything wrong.

pkg/streamingpromql/operator/operator.go Show resolved Hide resolved
return m.chunkIterator.Err()
case chunkenc.ValFloat:
t, f := m.chunkIterator.At()
if value.IsStaleNaN(f) || t < rangeStart {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to skip StaleNaN? Aren't they required to implement the correct behaviour for stale series? I'm fine if it's something that will be addressed in future PRs, but here I'm just trying to understand the rationale why they're skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests for this in a9c6d40. Prometheus' engine ignores staleness markers in range vector selectors (and functions like rate()), so the streaming engine does too.

I'm not sure this is correct for rate()... I'd expect a stale marker to be treated like a counter reset.

However, this seems to be expected behaviour based on this comment.

pkg/streamingpromql/operator/range_vector_selector.go Outdated Show resolved Hide resolved
pkg/streamingpromql/operator/range_vector_function.go Outdated Show resolved Hide resolved
pkg/streamingpromql/engine_test.go Outdated Show resolved Hide resolved
pkg/streamingpromql/operator/operator.go Outdated Show resolved Hide resolved
pkg/streamingpromql/operator/ring_buffer.go Outdated Show resolved Hide resolved
@charleskorn charleskorn force-pushed the charleskorn/streaming-promql-range-vector-selector branch from c56a094 to a9c6d40 Compare May 9, 2024 04:06
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments, and good job spotting the bug!

@charleskorn charleskorn enabled auto-merge (squash) May 9, 2024 09:53
@charleskorn charleskorn merged commit b03ecce into main May 9, 2024
29 checks passed
@charleskorn charleskorn deleted the charleskorn/streaming-promql-range-vector-selector branch May 9, 2024 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants