Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query ingesters in PENDING state #6726

Merged
merged 3 commits into from
Nov 26, 2023
Merged

Conversation

charleskorn
Copy link
Contributor

@charleskorn charleskorn commented Nov 23, 2023

What this PR does

This PR fixes an issue introduced in #5342.

We shouldn't consider PENDING ingesters as unhealthy: in the case where new ingesters have just started due to a scale-up, we'll consider each zone with a PENDING ingester as unhealthy, and so if two zones have just scaled up and have PENDING ingesters, the query will fail.

Furthermore, we should attempt to query ingesters in the PENDING state: in the case where an ingester has just started, queriers may have only observed the ingester in the PENDING state, but distributors may have observed the ingester in the ACTIVE state and started sending samples, for example:

  • New ingester in zone A starts, publishes PENDING state to ring
  • Querier observes new ingester in PENDING state
  • New ingester sets state to ACTIVE
  • Distributor observes new ingester as ACTIVE
  • Distributor writes sample to ingester in zones A, B and C, write to B fails due to transient network issue
  • Querier initiates query, selects zones A and B for initial requests

If we don't query the new zone A ingester that the querier still thinks is PENDING, then we'll miss the sample that is only in zone A and C.

Note that the downside of this change is that queriers will attempt to query ingesters that are possibly still starting up, and may not be ready to receive requests. In this case, hedging will cause the querier to try other ingesters, if available, after at most 2s, or sooner if the connection to the ingester fails before hedging is triggered, and one of the following will happen:

  • If ingesters in other zones are healthy, then the impact is simply increased latency. An improvement for this would be to prioritize querying zones where all ingesters are ACTIVE, but I will add this in a follow up PR (Prefer querying ingester zones with the least number of non-ACTIVE ingesters #6727)
  • If ingesters in other zones are not healthy (eg. they're also still starting up), then the query will fail. However, this will be mitigated by the query-frontend retrying the query, which may succeed on a later attempt. This is unavoidable if we want to guarantee correct query results.

Which issue(s) this PR fixes or relates to

Related: #5342

Checklist

  • [n/a] Tests updated.
  • [n/a] Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • [n/a] about-versioning.md updated with experimental features.

@charleskorn charleskorn force-pushed the charleskorn/query-pending-ingesters branch from 0d4a8f7 to 78a56d9 Compare November 24, 2023 03:15
pkg/distributor/query.go Outdated Show resolved Hide resolved
Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>
@charleskorn charleskorn enabled auto-merge (squash) November 26, 2023 21:51
@charleskorn charleskorn merged commit 4a27a99 into main Nov 26, 2023
28 checks passed
@charleskorn charleskorn deleted the charleskorn/query-pending-ingesters branch November 26, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants