-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optionally wait for the query-frontend to start up before rejecting requests #6621
Optionally wait for the query-frontend to start up before rejecting requests #6621
Conversation
docs/sources/mimir/references/configuration-parameters/index.md
Outdated
Show resolved
Hide resolved
…sTerminalError` resilient to other states added in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change looks ok, but this opens the possibility that any middlewares that were called before retry
did their job on a non-started service state, so their view of the "world" may be incomplete or wrong. If we retry in the retry middleware, then we don't run these other middlewares' function again. For example see below
I think this has the risk of introducing subtle bugs.
What do you think about introducing another middleware as a top-level middleware which checks the state of the whole mimir process and does waiting and rejections based on that?
Makes sense to me, I'll rework this now. |
…d other middleware running while frontend is in an inconsistent state
8dcf099
to
aafb70f
Compare
Turns out this is a bit more involved than I expected due to the separation of the |
3c9e13d
to
45ae983
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look mostly good. I'm not sure about that function arg. It is also introducing an implicit circular dependency between the QueryFrontend and the QueryFrontendTripperware modules. sorry for prolonging this review. I will be away tomorrow, so feel free to merge this before next week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
} | ||
|
||
// This method is not on frontendRunningRoundTripper to make it easier to test this logic. | ||
func awaitQueryFrontendServiceRunning(ctx context.Context, service services.Service, timeout time.Duration, log log.Logger) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) couldn't this be generalised as it can be applied to any services.Service, not just query frontend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could, but at the moment we don't have a need for it to be used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for addressing all my comments!
What this PR does
This PR modifies the behaviour of query-frontends to optionally wait (up to a configured timeout) for the frontend to be ready if a request is received while it is still starting up.
Under normal circumstances, query-frontends shouldn't receive requests while starting up, because their readiness probe will not succeed during this time and so won't be registered in the query-frontend Kubernetes service.
However, if a query-frontend restarts (eg. due to OOMing), nodes in the Kubernetes cluster will not immediately observe the restart and new unhealthy state, and so callers of the query-frontend can continue sending traffic to them while they're starting.
This can result in users receiving an error like
frontend not running: New
in response to queries.Based on observed behaviour in production clusters at Grafana Labs, most frontends start in less than 2s, so setting the timeout to 2s would be a reasonable choice.
Which issue(s) this PR fixes or relates to
(none)
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]