Optionally wait for the query-frontend to start up before rejecting requests #6621

charleskorn · 2023-11-13T00:43:50Z

What this PR does

This PR modifies the behaviour of query-frontends to optionally wait (up to a configured timeout) for the frontend to be ready if a request is received while it is still starting up.

Under normal circumstances, query-frontends shouldn't receive requests while starting up, because their readiness probe will not succeed during this time and so won't be registered in the query-frontend Kubernetes service.

However, if a query-frontend restarts (eg. due to OOMing), nodes in the Kubernetes cluster will not immediately observe the restart and new unhealthy state, and so callers of the query-frontend can continue sending traffic to them while they're starting.

This can result in users receiving an error like frontend not running: New in response to queries.

Based on observed behaviour in production clusters at Grafana Labs, most frontends start in less than 2s, so setting the timeout to 2s would be a reasonable choice.

Which issue(s) this PR fixes or relates to

(none)

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…ing up

docs/sources/mimir/configure/about-versioning.md

docs/sources/mimir/references/configuration-parameters/index.md

pkg/frontend/querymiddleware/retry.go

…sTerminalError` resilient to other states added in the future

jhesketh

lgtm

cmd/mimir/help-all.txt.tmpl

dimitarvdimitrov

the change looks ok, but this opens the possibility that any middlewares that were called before retry did their job on a non-started service state, so their view of the "world" may be incomplete or wrong. If we retry in the retry middleware, then we don't run these other middlewares' function again. For example see below

I think this has the risk of introducing subtle bugs.

What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?

pkg/util/errors.go

charleskorn · 2023-11-14T00:21:36Z

What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?

Makes sense to me, I'll rework this now.

…d other middleware running while frontend is in an inconsistent state

charleskorn · 2023-11-14T03:09:10Z

What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?

Turns out this is a bit more involved than I expected due to the separation of the QueryFrontendTripperware and QueryFrontend services, but I think I've got a good solution now in aafb70f. Let me know what you think @dimitarvdimitrov.

pkg/frontend/querymiddleware/roundtrip.go

pkg/frontend/querymiddleware/running.go

dimitarvdimitrov

The changes look mostly good. I'm not sure about that function arg. It is also introducing an implicit circular dependency between the QueryFrontend and the QueryFrontendTripperware modules. sorry for prolonging this review. I will be away tomorrow, so feel free to merge this before next week

pkg/frontend/querymiddleware/roundtrip.go

pkg/mimir/modules.go

jhesketh

lgtm

jhesketh · 2023-11-17T02:45:31Z

pkg/frontend/querymiddleware/running.go

+}
+
+// This method is not on frontendRunningRoundTripper to make it easier to test this logic.
+func awaitQueryFrontendServiceRunning(ctx context.Context, service services.Service, timeout time.Duration, log log.Logger) error {


(nit) couldn't this be generalised as it can be applied to any services.Service, not just query frontend?

It could, but at the moment we don't have a need for it to be used elsewhere.

dimitarvdimitrov

LGTM, thanks for addressing all my comments!

charleskorn added 2 commits November 13, 2023 11:35

Backoff and retry requests received while the query-frontend is start…

4b6549e

…ing up

Add changelog entry.

a096fed

charleskorn marked this pull request as ready for review November 13, 2023 01:16

charleskorn requested review from a team as code owners November 13, 2023 01:16

jhesketh requested changes Nov 13, 2023

View reviewed changes

docs/sources/mimir/configure/about-versioning.md Outdated Show resolved Hide resolved

docs/sources/mimir/references/configuration-parameters/index.md Outdated Show resolved Hide resolved

pkg/frontend/querymiddleware/retry.go Outdated Show resolved Hide resolved

Address PR feedback: disable backoff by default, rename flag, make `i…

dc424ce

…sTerminalError` resilient to other states added in the future

charleskorn requested a review from jhesketh November 13, 2023 02:26

jhesketh approved these changes Nov 13, 2023

View reviewed changes

cmd/mimir/help-all.txt.tmpl Outdated Show resolved Hide resolved

dimitarvdimitrov reviewed Nov 13, 2023

View reviewed changes

pkg/util/errors.go Outdated Show resolved Hide resolved

charleskorn added 2 commits November 14, 2023 11:27

Make formatting of newQueryTripperware consistent.

e1dd3fb

Check frontend is running earlier, to avoid unnecessary work and avoi…

aafb70f

…d other middleware running while frontend is in an inconsistent state

charleskorn force-pushed the charleskorn/back-off-on-frontend-not-ready branch from 8dcf099 to aafb70f Compare November 14, 2023 03:07

charleskorn requested a review from dimitarvdimitrov November 14, 2023 03:09

dimitarvdimitrov reviewed Nov 14, 2023

View reviewed changes

pkg/frontend/querymiddleware/roundtrip.go Outdated Show resolved Hide resolved

pkg/frontend/querymiddleware/running.go Outdated Show resolved Hide resolved

pkg/frontend/querymiddleware/running.go Outdated Show resolved Hide resolved

charleskorn changed the title ~~Backoff and retry requests received while the query-frontend is starting up~~ Optionally wait for the query-frontend to start up before rejecting requests Nov 15, 2023

charleskorn added 3 commits November 15, 2023 11:49

Refactor to use service.AwaitRunning()

e22f709

Update changelog entry to reflect new behaviour and config option name

cca1687

Address PR feedback: ie. -> i.e.

45ae983

charleskorn force-pushed the charleskorn/back-off-on-frontend-not-ready branch from 3c9e13d to 45ae983 Compare November 15, 2023 00:49

charleskorn and others added 4 commits November 15, 2023 12:28

Address PR feedback: pass interface, not function

7df3610

Simplify code.

7d25ba3

Apply same logic to cardinality and series endpoints as well.

4c482a9

Merge branch 'main' into charleskorn/back-off-on-frontend-not-ready

8674e05

charleskorn requested a review from dimitarvdimitrov November 16, 2023 00:05

dimitarvdimitrov approved these changes Nov 16, 2023

View reviewed changes

pkg/frontend/querymiddleware/roundtrip.go Outdated Show resolved Hide resolved

pkg/mimir/modules.go Outdated Show resolved Hide resolved

charleskorn added 2 commits November 17, 2023 10:55

Address PR feedback: remove implicit circular dependency.

f3fae27

Update outdated comment.

3e851e9

Make method and trace span names consistent

5699d7a

charleskorn requested review from jhesketh and dimitarvdimitrov November 17, 2023 00:00

jhesketh approved these changes Nov 17, 2023

View reviewed changes

dimitarvdimitrov approved these changes Nov 20, 2023

View reviewed changes

charleskorn merged commit a27952a into main Nov 20, 2023
28 checks passed

charleskorn deleted the charleskorn/back-off-on-frontend-not-ready branch November 20, 2023 23:05

kjelljoakim mentioned this pull request Feb 21, 2024

Panic in query-frontend roundtripper when frontend.downstream_url is set #7436

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optionally wait for the query-frontend to start up before rejecting requests #6621

Optionally wait for the query-frontend to start up before rejecting requests #6621

charleskorn commented Nov 13, 2023 •

edited

Loading

jhesketh left a comment

dimitarvdimitrov left a comment

charleskorn commented Nov 14, 2023

charleskorn commented Nov 14, 2023

dimitarvdimitrov left a comment

jhesketh left a comment

jhesketh Nov 17, 2023

charleskorn Nov 17, 2023

dimitarvdimitrov left a comment

Optionally wait for the query-frontend to start up before rejecting requests #6621

Optionally wait for the query-frontend to start up before rejecting requests #6621

Conversation

charleskorn commented Nov 13, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

jhesketh left a comment

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

charleskorn commented Nov 14, 2023

charleskorn commented Nov 14, 2023

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

jhesketh left a comment

Choose a reason for hiding this comment

jhesketh Nov 17, 2023

Choose a reason for hiding this comment

charleskorn Nov 17, 2023

Choose a reason for hiding this comment

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

charleskorn commented Nov 13, 2023 •

edited

Loading