Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally wait for the query-frontend to start up before rejecting requests #6621

Merged
merged 15 commits into from
Nov 20, 2023

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Nov 13, 2023

What this PR does

This PR modifies the behaviour of query-frontends to optionally wait (up to a configured timeout) for the frontend to be ready if a request is received while it is still starting up.

Under normal circumstances, query-frontends shouldn't receive requests while starting up, because their readiness probe will not succeed during this time and so won't be registered in the query-frontend Kubernetes service.

However, if a query-frontend restarts (eg. due to OOMing), nodes in the Kubernetes cluster will not immediately observe the restart and new unhealthy state, and so callers of the query-frontend can continue sending traffic to them while they're starting.

This can result in users receiving an error like frontend not running: New in response to queries.

Based on observed behaviour in production clusters at Grafana Labs, most frontends start in less than 2s, so setting the timeout to 2s would be a reasonable choice.

Which issue(s) this PR fixes or relates to

(none)

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@charleskorn charleskorn marked this pull request as ready for review November 13, 2023 01:16
@charleskorn charleskorn requested review from a team as code owners November 13, 2023 01:16
docs/sources/mimir/configure/about-versioning.md Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/retry.go Outdated Show resolved Hide resolved
…sTerminalError` resilient to other states added in the future
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

cmd/mimir/help-all.txt.tmpl Outdated Show resolved Hide resolved
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the change looks ok, but this opens the possibility that any middlewares that were called before retry did their job on a non-started service state, so their view of the "world" may be incomplete or wrong. If we retry in the retry middleware, then we don't run these other middlewares' function again. For example see below

Screenshot 2023-11-13 at 13 04 46

I think this has the risk of introducing subtle bugs.

What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?

pkg/util/errors.go Outdated Show resolved Hide resolved
@charleskorn
Copy link
Contributor Author

What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?

Makes sense to me, I'll rework this now.

@charleskorn charleskorn force-pushed the charleskorn/back-off-on-frontend-not-ready branch from 8dcf099 to aafb70f Compare November 14, 2023 03:07
@charleskorn
Copy link
Contributor Author

What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?

Turns out this is a bit more involved than I expected due to the separation of the QueryFrontendTripperware and QueryFrontend services, but I think I've got a good solution now in aafb70f. Let me know what you think @dimitarvdimitrov.

pkg/frontend/querymiddleware/roundtrip.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/running.go Outdated Show resolved Hide resolved
pkg/frontend/querymiddleware/running.go Outdated Show resolved Hide resolved
@charleskorn charleskorn changed the title Backoff and retry requests received while the query-frontend is starting up Optionally wait for the query-frontend to start up before rejecting requests Nov 15, 2023
@charleskorn charleskorn force-pushed the charleskorn/back-off-on-frontend-not-ready branch from 3c9e13d to 45ae983 Compare November 15, 2023 00:49
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look mostly good. I'm not sure about that function arg. It is also introducing an implicit circular dependency between the QueryFrontend and the QueryFrontendTripperware modules. sorry for prolonging this review. I will be away tomorrow, so feel free to merge this before next week

pkg/frontend/querymiddleware/roundtrip.go Outdated Show resolved Hide resolved
pkg/mimir/modules.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jhesketh jhesketh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

}

// This method is not on frontendRunningRoundTripper to make it easier to test this logic.
func awaitQueryFrontendServiceRunning(ctx context.Context, service services.Service, timeout time.Duration, log log.Logger) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) couldn't this be generalised as it can be applied to any services.Service, not just query frontend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could, but at the moment we don't have a need for it to be used elsewhere.

Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing all my comments!

@charleskorn charleskorn merged commit a27952a into main Nov 20, 2023
28 checks passed
@charleskorn charleskorn deleted the charleskorn/back-off-on-frontend-not-ready branch November 20, 2023 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants