Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store-gateway: add timeout to query gate wait #7777

Merged

Conversation

dimitarvdimitrov
Copy link
Contributor

What this PR does

This PR adds a timeout to query gate waiting. The store-gateway returns an empty response. The querier treats the empty response as if the blocks weren't discovered by the store-gateway and tries them again on another store-gateway.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/storage/tsdb/config.go Outdated Show resolved Hide resolved
@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks.
done, err := s.limitConcurrentQueries(ctx, stats)
if err != nil {
if errors.Is(err, errGateTimeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a timeout occurs here, should we add an event to the trace span or log something?

It might also be useful to add a metric for the number of queries that were rejected due to the gate timing out - a high rate of this indicates either a performance problem or suggests scaling out store-gateways would be useful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be useful to add a metric for the number of queries that were rejected due to the gate timing out - a high rate of this indicates either a performance problem or suggests scaling out store-gateways would be useful.

A metric has been added directly in dskit:
grafana/dskit#512

It doesn't tell exactly the reason why the gate wasn't allowed tho.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a timeout occurs here, should we add an event to the trace span or log something?

good idea. I will add a span log

It doesn't tell exactly the reason why the gate wasn't allowed tho.

with the current setup a query can be "not_permitted" either because it was cancelled (context.Canceled) or because it timed out (context.DeadlineExceeded). I can add this distinction in the metrics. There's also the response metrics of the gRPC method calls which would record status_code="Canceled" or status_code="OK", but it might be more clear to have this separation in the dskit metrics

@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks.
done, err := s.limitConcurrentQueries(ctx, stats)
if err != nil {
if errors.Is(err, errGateTimeout) {
// If the gate timed out, then we behave in the same way as if the blocks aren't discovered yet.
// The querier will try the blocks again on a different store-gateway replica.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this situation logged by queriers? (I'm wondering if we'll get a misleading or confusing message logged by queriers along the lines of "store-gateway hasn't discovered blocks", when the truth is that the store-gateway timed out the request.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar concern here. Why not returning the error? The querier will retry the request on other store-gateways in case of error, unless it's detected as a non retriable error (but we make sure the error we return here is detected as retryable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this situation logged by queriers?

this is the SpanLog that the querier logs

spanLog.DebugLog("msg", "couldn't query all blocks", "attempt", attempt, "missing blocks", strings.Join(convertULIDsToString(remainingBlocks), " "))

If all of the store-gateways cannot query a block, then this is the error that the querier generates:

err = newStoreConsistencyCheckFailedError(remainingBlocks)
level.Warn(util_log.WithContext(ctx, spanLog)).Log("msg", "failed consistency check after all attempts", "err", err)

spanLog.DebugLog("msg", "couldn't query all blocks", "attempt", attempt, "missing blocks", strings.Join(convertULIDsToString(remainingBlocks), " "))


Why not returning the error? The querier will retry the request on other store-gateways in case of error, unless it's detected as a non retriable error (but we make sure the error we return here is detected as retryable).

That's not the case unfortunately. If the store-gateway returns an error, then the querier will immediately abort the query:

// Fetch series from stores. If an error occur we do not retry because retries
// are only meant to cover missing blocks.
queriedBlocks, err := queryF(clients, minT, maxT)
if err != nil {
return err
}

I did consider returning a special error, but that seemed more complicated because of response processing in fetchSeriesFromStores. If you still think it's a better idea to have a special error and handle it properly, then I can do the necessary refactors; perhaps even adopt a similar error-handling pattern to the distributor-ingester ingestion error handling.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the case unfortunately. If the store-gateway returns an error, then the querier will immediately abort the query:

I'm not sure about it. I think the "magic" is done by queryF. For example, look at fetchSeriesFromStores() implementation and its call to shouldStopQueryFunc(). I think the idea is that queryF only return a fatal error, while swallow errors that shouldn't stop the query execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right there is shouldStopQueryFunc

if shouldStopQueryFunc(err) {
return err
}
level.Warn(log).Log("msg", "failed to fetch series", "remote", c.RemoteAddress(), "err", err)
return nil

let me give it a try to add some error handling similar to distributor-ingester errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ended up being more complicated than i expected too. I opened a PR to prepare for this change #7790. I could have copied some code too, but wanted to move towards reducing code duplication

pkg/querier/blocks_store_queryable_test.go Outdated Show resolved Hide resolved
@pracucci pracucci self-requested a review April 3, 2024 06:53
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks. I'm not convinced by not returning an error from the store-gateway: to me, it looks more natural returning an error.

@@ -420,6 +421,27 @@ func (u *BucketStores) syncDirForUser(userID string) string {
return filepath.Join(u.cfg.BucketStore.SyncDir, userID)
}

type timeoutGate struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Looks something we could have done in dskit. Non blocking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also realized this after submitting the PR. But dskit doesn't have context.WithTimeoutCause and implementing that isn't as trivial as I expected. Let's keep it here as the scope is very small anyways. Someone can move it to dskit when it's updated. I left a comment on timeoutGate 7bf6b24

@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks.
done, err := s.limitConcurrentQueries(ctx, stats)
if err != nil {
if errors.Is(err, errGateTimeout) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be useful to add a metric for the number of queries that were rejected due to the gate timing out - a high rate of this indicates either a performance problem or suggests scaling out store-gateways would be useful.

A metric has been added directly in dskit:
grafana/dskit#512

It doesn't tell exactly the reason why the gate wasn't allowed tho.

@@ -611,6 +611,11 @@ func (s *BucketStore) Series(req *storepb.SeriesRequest, srv storegatewaypb.Stor
// but sometimes it can take minutes if the block isn't loaded and there is a surge in queries for unloaded blocks.
done, err := s.limitConcurrentQueries(ctx, stats)
if err != nil {
if errors.Is(err, errGateTimeout) {
// If the gate timed out, then we behave in the same way as if the blocks aren't discovered yet.
// The querier will try the blocks again on a different store-gateway replica.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a similar concern here. Why not returning the error? The querier will retry the request on other store-gateways in case of error, unless it's detected as a non retriable error (but we make sure the error we return here is detected as retryable).

Copy link
Member

@jdbaldry jdbaldry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuration documentation lgtm

@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/st-gw/add-timeout-to-query-gate-waiting branch from 2ae4e2b to 13fad9e Compare May 3, 2024 08:26
dimitarvdimitrov and others added 9 commits May 3, 2024 10:26
This PR adds a timeout to query gate waiting. The store-gateway returns an empty response. The querier treats the empty response as if the blocks weren't discovered by the store-gateway and tries them again on another store-gateway.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Co-authored-by: Charles Korn <charleskorn@users.noreply.github.com>
Co-authored-by: Charles Korn <charleskorn@users.noreply.github.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…cy limit

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/st-gw/add-timeout-to-query-gate-waiting branch from 13fad9e to ed7012c Compare May 3, 2024 08:27
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
…d_other

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/st-gw/add-timeout-to-query-gate-waiting branch from 816718d to df43e5f Compare May 3, 2024 11:11
Comment on lines +3291 to +3318
"should not stop query on store-gateway instance limit": {
err: globalerror.NewErrorWithGRPCStatus(errors.New("instance limit"), codes.Aborted, &mimirpb.ErrorDetails{Cause: mimirpb.INSTANCE_LIMIT}),
expected: false,
},
"should not stop query on store-gateway instance limit; shouldn't look at the gRPC code, only Mimir error cause": {
err: globalerror.NewErrorWithGRPCStatus(errors.New("instance limit"), codes.Internal, &mimirpb.ErrorDetails{Cause: mimirpb.INSTANCE_LIMIT}),
expected: false,
},
"should not stop query on any other mimirpb error": {
err: globalerror.NewErrorWithGRPCStatus(errors.New("instance limit"), codes.Internal, &mimirpb.ErrorDetails{Cause: mimirpb.TOO_BUSY}),
expected: false,
},
"should not stop query on any unknown error detail": {
err: func() error {
st, createErr := status.New(codes.Internal, "test").WithDetails(&hintspb.Block{Id: "123"})
require.NoError(t, createErr)
return st.Err()
}(),
expected: false,
},
"should not stop query on multiple error details": {
err: func() error {
st, createErr := status.New(codes.Internal, "test").WithDetails(&hintspb.Block{Id: "123"}, &mimirpb.ErrorDetails{Cause: mimirpb.INSTANCE_LIMIT})
require.NoError(t, createErr)
return st.Err()
}(),
expected: false,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change the implementation, because that was the default behaviour anyways. But decided to add tests because the current behaviour is implicit rather than explicit.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, LGTM!

@dimitarvdimitrov dimitarvdimitrov merged commit 61a1c8b into main May 3, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants