Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query-Frontend Prometheus Codec: Metrics Query MinTime and MaxTime, Apply Fix to Query Component Assignment #7742

Conversation

francoposa
Copy link
Member

@francoposa francoposa commented Mar 27, 2024

What this PR does

Prometheus Codec for Metrics Query types only expose the Start and End, with no understanding of the query itself or how a range vector or offset can change the actual time range queried.

First noticed as part of an issue where the frontend's query scheduler adapter was assigning the expected query component to be ingesters only, based on start and end timestamps, but the query was actually looking back over many hours or days, meaning the expected query component should have been assigned as ingesters and store gateways.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@@ -270,16 +273,17 @@ func (prometheusCodec) decodeRangeQueryRequest(r *http.Request) (MetricsQueryReq
return nil, err
}

result.Query = r.FormValue("query")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any use of formvalue is a potential bug as it consumes the body on a POST.
everything should use util.ParseRequestFormWithoutConsumingBody, then work with the map of values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a linting rule for this?

@francoposa francoposa force-pushed the francoposa/frontend-prometheus-codec-min-max-time-from-query-parsing branch from 3188129 to b3979cd Compare April 3, 2024 20:55
if err != nil {
return nil, err
}
return a.queryComponentQueueDimensionFromTimeParams(tenantIDs, time, time, now), nil

return a.queryComponentQueueDimensionFromTimeParams(tenantIDs, minT, maxT, now), nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the actual bugfix - instead of using start/end or time for range and instant queries, we are determining this query component with actual minT and maxT for the query.

@francoposa francoposa changed the title Query-Frontend Prometheus Codec Metrics Query MinTime and MaxTime Query-Frontend Prometheus Codec Metrics Query MinTime and MaxTime, Apply Fix to Query Component Assignment Apr 3, 2024
@francoposa francoposa marked this pull request as ready for review April 3, 2024 23:40
@francoposa francoposa requested a review from a team as a code owner April 3, 2024 23:40
@francoposa francoposa changed the title Query-Frontend Prometheus Codec Metrics Query MinTime and MaxTime, Apply Fix to Query Component Assignment Query-Frontend Prometheus Codec: Metrics Query MinTime and MaxTime, Apply Fix to Query Component Assignment Apr 3, 2024
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks good to me.

pkg/frontend/querymiddleware/codec.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/model_extra.go Outdated Show resolved Hide resolved
@francoposa
Copy link
Member Author

Overall approach looks good to me.

thanks for the feedback!

Will end up needing a rebase or re-do here if I get #7810 through - point is to abandon proto representations so we can enforce relationships between minT, maxT, query, etc.

@francoposa francoposa force-pushed the francoposa/frontend-prometheus-codec-min-max-time-from-query-parsing branch 2 times, most recently from fbfa226 to eec43e1 Compare April 8, 2024 17:50
@francoposa
Copy link
Member Author

@charleskorn updated approach to only parse query once, though it still walks the queryExpr tree to get min and max each time - I think I am fine with that for now.
Unfortunately this created a bunch of boilerplate test changes throughout the frontend since WithQuery can return an error for a bad parse now.

We could actually set minT and maxT on the object and only walk the tree on object creation or the WithStartEnd/WithQuery transforms, but for that we would want to create a constructor which doesn't currently exist and would probably create a bunch more boilerplate changes.

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could actually set minT and maxT on the object and only walk the tree on object creation or the WithStartEnd/WithQuery transforms, but for that we would want to create a constructor which doesn't currently exist and would probably create a bunch more boilerplate changes.

Given we only call GetMinT() and GetMaxT() in one place, and call both at the same time, what if we had a single GetTimeRange() method that returns both? Then we won't pay the cost of walking the tree twice.

@@ -270,16 +273,17 @@ func (prometheusCodec) decodeRangeQueryRequest(r *http.Request) (MetricsQueryReq
return nil, err
}

result.Query = r.FormValue("query")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a linting rule for this?

@francoposa
Copy link
Member Author

Given we only call GetMinT() and GetMaxT() in one place, and call both at the same time, what if we had a single GetTimeRange() method that returns both? Then we won't pay the cost of walking the tree twice.

I wanted to keep GetMinT and GetMaxT separate given the existing style of GetStart and GetEnd.

If we collapsed GetMinT and GetMaxT together I would want to do the same for GetStart and GetEnd.

Alternatively I could just go with the route mentioned of moving to a constructor and only parsing and walking on constructors or transforms. I think this is a bit more of a complete solution.

Either option would be some more boilerplate-y changes - up to you if you want them to come in this PR or a follow-on one

@charleskorn
Copy link
Contributor

I wanted to keep GetMinT and GetMaxT separate given the existing style of GetStart and GetEnd.

If we collapsed GetMinT and GetMaxT together I would want to do the same for GetStart and GetEnd.

Alternatively I could just go with the route mentioned of moving to a constructor and only parsing and walking on constructors or transforms. I think this is a bit more of a complete solution.

I'm OK with GetStart/GetEnd and GetTimeRange having different styles, but either solution (single method or constructor) is fine with me.

Either option would be some more boilerplate-y changes - up to you if you want them to come in this PR or a follow-on one

I think it'd be best to do whichever choice you choose as part of this PR.

@@ -62,47 +62,16 @@ func Test_queryStatsMiddleware_Do(t *testing.T) {
Step: step,
},
},
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer a case as the query will have already been validated and parsed before it is in the request object

@francoposa
Copy link
Member Author

change to use constructor and only parse query on construct or transform is complete for the range requests - need to do instant now, should be shorter since there's a lot fewer components and tests using instant queries

@francoposa
Copy link
Member Author

francoposa commented Apr 12, 2024

it's a doozy with all the test fixtures to update, but it's all ready.

Constructors NewPrometheusRangeQueryRequest and NewPrometheusInstantQueryRequest, as well as each type'
s respective WithStartEnd and WithQuery implementation now calculate minT and maxT and set them as fields.

All type fields are now private and exposed through getters and setters to avoid being able to set query, start, end, step, or lookback delta without the derived minT and maxT being updated as well.

WithStartEnd and WithQuery transforms are tested for consistency.

the only functional things that should have changed are

  1. the original bug to fix, which is determining expected query component in frontend scheduler adapter with minT and maxT, not start and end
  2. query blocker and caching should actually be a bit more stable now because they generate cache keys from the parsed-then-stringified query, rather than the unvalidated string query, as the parsed query's String() method always formats it in the same way, regardless of original query format.
    i. I doubt there were ever that many cases of this in the wild
    ii. it reorders operators, like sum(container_memory_rss) by (namespace) becomes sum by (namespace) (container_memory_rss)
    iii. and removes braces from metrics names when there is no selector - foo{} becomes foo

Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks for adding the constructor.

  1. query blocker and caching should actually be a bit more stable now because they generate cache keys from the parsed-then-stringified query, rather than the unvalidated string query, as the parsed query's String() method always formats it in the same way, regardless of original query format.
    1. I doubt there were ever that many cases of this in the wild
    2. it reorders operators, like sum(container_memory_rss) by (namespace) becomes sum by (namespace) (container_memory_rss)
    3. and removes braces from metrics names when there is no selector - foo{} becomes foo

Might be worth calling out in the changelog entry that this will cause some query result cache churn when first deployed.

Will the pretty-printed version of the expression be logged anywhere? If not, it's going to be difficult to block problematic queries. If it is logged somewhere, we should mention both that the pretty-printed query is used and how to see the pretty-printed query in the docs for query blocking.

@@ -171,7 +171,8 @@ func (s *splitInstantQueryByIntervalMiddleware) Do(ctx context.Context, req Metr
s.metrics.splitQueriesPerQuery.Observe(float64(mapperStats.GetSplitQueries()))

// Send hint with number of embedded queries to the sharding middleware
req = req.WithQuery(instantSplitQuery.String()).WithTotalQueriesHint(int32(mapperStats.GetSplitQueries()))
req, _ = req.WithQuery(instantSplitQuery.String()) // expect no error as instantSplitQuery is already a valid prometheus query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added a WithExpr method here? Then we wouldn't need to pay the price of turning the existing Expr into a string and then parsing it again

@francoposa
Copy link
Member Author

Will the pretty-printed version of the expression be logged anywhere? If not, it's going to be difficult to block problematic queries. If it is logged somewhere, we should mention both that the pretty-printed query is used and how to see the pretty-printed query in the docs for query blocking.

So I looked at this from a few different angles, and the query is logged when it's blocked, but not when it's not blocked.
So the primary purpose is to enable people to block the correct queries:

  • cost of blocking something you didn't mean to is low
  • cost of not blocking something you need to block is potentially very high

In either case we should enable the user to enter the query block correctly.
I checked if anything existed to do this, and the latest version of promtool does it, behind an experimental flag:

$ promtool --experimental promql format "foo{}"
foo

however this only pretty-prints it. There are two methods for string-formatting a promql expression: String and Pretty.
We only use String in the frontend, and in any case it would probably be a poor experience to put pretty-printed queries with tons of whitespace in the blocker YAML.

thankfully there's nothing magical in the promtool version we can't do ourselves in mimirtool:
https://github.com/prometheus/prometheus/blob/0305490e4e7baf9cb69e4bf5b131238fb38ffa77/cmd/promtool/main.go#L1221-L1229

func formatPromQL(query string) error {
	expr, err := parser.ParseExpr(query)
	if err != nil {
		return err
	}

	fmt.Println(expr.Pretty(0))
	return nil
}

I think the best approach here is to add something like this like mimirtool promql format [--pretty] expr, and put it in the query blocking docs that you can use it to see how you format your query blocking entries

@francoposa francoposa force-pushed the francoposa/frontend-prometheus-codec-min-max-time-from-query-parsing branch from b53fab6 to f66a66e Compare April 25, 2024 22:54
@francoposa
Copy link
Member Author

everything I mentioned above is done, and WithExpr is tested alongside WithQuery.

Feedback welcome on changelog wording of course

CHANGELOG.md Outdated Show resolved Hide resolved
## Formatting queries to block

Queries received by Mimir are parsed into PromQL expressions before blocking is applied.
The `pattern` from the blocked queries is compared against Prometheus' string format representation of the parsed query,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Maybe this would be clearer?

Suggested change
The `pattern` from the blocked queries is compared against Prometheus' string format representation of the parsed query,
The `pattern` from the blocked queries is compared against the formatted representation of the parsed query,

If so, I would use the same term below (rather than referring to Prometheus).

docs/sources/mimir/configure/configure-blocked-queries.md Outdated Show resolved Hide resolved

// PromQL Format Query Command
promqlFormatCmd := promqlCmd.Command("format", "Format PromQL query with Prometheus' string formatter; wrap query in quotes for CLI parsing.").Action(c.formatQuery)
promqlFormatCmd.Flag("pretty", "use Prometheus' pretty-print formatter").BoolVar(&c.prettyPrint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Prometheus relevant here? I think we can just say something like "pretty-print expression"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fact that it's using upstream code is relevant for a user technical enough to be doing this. It implies a level of stability and consistency with other tooling like promtool - using the --pretty flag here is identical to the promtool command. The promtool version just doesn't allow non-pretty at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the fact we're using the upstream code is so relevant here or in the docs above - query blocking is a Mimir-only feature, so alignment with how Prometheus formats queries isn't so important, but consistency with how the query-blocking feature formats queries is important.

pkg/mimirtool/commands/promql.go Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
Copy link
Member

@jdbaldry jdbaldry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to reorder the conceptual material to focus on why the user wants to perform this task up front, after that is done, I can review the language specifically.

nit: the arbitrary line breaks are worse for line based diffing in GitHub in the future, consider using semantic line breaks instead.

docs/sources/mimir/configure/configure-blocked-queries.md Outdated Show resolved Hide resolved
docs/sources/mimir/configure/configure-blocked-queries.md Outdated Show resolved Hide resolved
docs/sources/mimir/configure/configure-blocked-queries.md Outdated Show resolved Hide resolved
@francoposa
Copy link
Member Author

docs shell style has been updated.

The line breaks are already semantic, not arbitrary they all break at the latest possible sentence clause break while still keeping line length <= 120 chars.

There are non-semantic breaks in the existing language in the doc but I left them in order not to create further meaningless changes in this diff.

@francoposa
Copy link
Member Author

tests were split, Changelog feedback added.

Resolved everything but the ideas about not referring to using the Prometheus formatter.
As mentioned above I personally feel that using explicitly using the upstream is an advantage from the perspective of an end user technical enough to use query blocking.
It implies that it should be a pretty stable and consistent formatting, not Mimir diverging from what upstream does and then trying to achieve consistency later. If the (currently-experimental) new promtool promql format supported non-pretty string formatting, we probably would direct users there as an alternative to mimirtool.

pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/codec_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@francoposa francoposa merged commit f43c62e into main May 1, 2024
29 checks passed
@francoposa francoposa deleted the francoposa/frontend-prometheus-codec-min-max-time-from-query-parsing branch May 1, 2024 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants