Query ingesters in PENDING state #6726

charleskorn · 2023-11-23T23:52:22Z

What this PR does

This PR fixes an issue introduced in #5342.

We shouldn't consider PENDING ingesters as unhealthy: in the case where new ingesters have just started due to a scale-up, we'll consider each zone with a PENDING ingester as unhealthy, and so if two zones have just scaled up and have PENDING ingesters, the query will fail.

Furthermore, we should attempt to query ingesters in the PENDING state: in the case where an ingester has just started, queriers may have only observed the ingester in the PENDING state, but distributors may have observed the ingester in the ACTIVE state and started sending samples, for example:

New ingester in zone A starts, publishes PENDING state to ring
Querier observes new ingester in PENDING state
New ingester sets state to ACTIVE
Distributor observes new ingester as ACTIVE
Distributor writes sample to ingester in zones A, B and C, write to B fails due to transient network issue
Querier initiates query, selects zones A and B for initial requests

If we don't query the new zone A ingester that the querier still thinks is PENDING, then we'll miss the sample that is only in zone A and C.

Note that the downside of this change is that queriers will attempt to query ingesters that are possibly still starting up, and may not be ready to receive requests. In this case, hedging will cause the querier to try other ingesters, if available, after at most 2s, or sooner if the connection to the ingester fails before hedging is triggered, and one of the following will happen:

If ingesters in other zones are healthy, then the impact is simply increased latency. An improvement for this would be to prioritize querying zones where all ingesters are ACTIVE, but I will add this in a follow up PR (Prefer querying ingester zones with the least number of non-ACTIVE ingesters #6727)
If ingesters in other zones are not healthy (eg. they're also still starting up), then the query will fail. However, this will be mitigated by the query-frontend retrying the query, which may succeed on a later attempt. This is unavoidable if we want to guarantee correct query results.

Which issue(s) this PR fixes or relates to

Related: #5342

Checklist

[n/a] Tests updated.
[n/a] Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
[n/a] about-versioning.md updated with experimental features.

pkg/distributor/query.go

Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>

Query ingesters in PENDING state

159590a

charleskorn marked this pull request as ready for review November 24, 2023 00:13

charleskorn requested a review from a team as a code owner November 24, 2023 00:13

This was referenced Nov 24, 2023

DoUntilQuorum: add the ability to prioritise zones when request minimisation is enabled grafana/dskit#440

Merged

Prefer querying ingester zones with the least number of non-ACTIVE ingesters #6727

Merged

Add changelog entry.

78a56d9

charleskorn force-pushed the charleskorn/query-pending-ingesters branch from 0d4a8f7 to 78a56d9 Compare November 24, 2023 03:15

pstibrany approved these changes Nov 24, 2023

View reviewed changes

colega reviewed Nov 24, 2023

View reviewed changes

pkg/distributor/query.go Outdated Show resolved Hide resolved

Address PR feedback: elaborate on comment

95c92c7

Co-authored-by: Oleg Zaytsev <mail@olegzaytsev.com>

charleskorn enabled auto-merge (squash) November 26, 2023 21:51

charleskorn merged commit 4a27a99 into main Nov 26, 2023
28 checks passed

charleskorn deleted the charleskorn/query-pending-ingesters branch November 26, 2023 22:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query ingesters in PENDING state #6726

Query ingesters in PENDING state #6726

charleskorn commented Nov 23, 2023 •

edited

Loading

Query ingesters in PENDING state #6726

Query ingesters in PENDING state #6726

Conversation

charleskorn commented Nov 23, 2023 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

charleskorn commented Nov 23, 2023 •

edited

Loading