Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query-frontend worker: don't treat cancel as an error #4648

Merged
merged 1 commit into from
Nov 23, 2023

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Apr 1, 2023

What this PR does

When a query returns an error value, check whether that error is actually a cancellation before acting.
I expect two effects: the worker won't log "context canceled", and it will process the next request immediately.

Could be the user closed the tab, or it's a sharded query and some other shard errored. Whatever the cause, cancellation is not an error.

Checklist

  • NA Tests updated
  • NA Documentation added
  • CHANGELOG.md updated

@@ -14,6 +14,7 @@ import (
"github.com/go-kit/log"
"github.com/go-kit/log/level"
"github.com/grafana/dskit/backoff"
grpcUtils "github.com/weaveworks/common/grpc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package names, and aliases, should be all lowercase, so grpcutil.

Also an opinionated nit, but in this case I'd call it commongrpc.

@colega
Copy link
Contributor

colega commented Apr 3, 2023

Can we write a test for this? These issues keep biting us again and again.

@bboreham
Copy link
Contributor Author

bboreham commented Apr 3, 2023

Can we write a test for this? These issues keep biting us again and again.

I found it hard to think of a generic test to catch such issues by behaviour, but we could check all the logs from integration tests for the string "context canceled".

Copy link
Contributor

@colega colega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a test would be great, but I think we should prioritize delivering this. If we merge without a test, we could create a good-first-issue labeled issue to write a test.

@bboreham
Copy link
Contributor Author

bboreham commented Apr 4, 2023

we could check all the logs from integration tests for the string "context canceled".

There are lots of reports when querier shuts down, e.g. here:

17:48:01 Stopping querier
17:48:01 querier: level=error ts=2023-04-03T17:48:01.927794752Z caller=frontend_processor.go:88 msg="error processing requests" address=172.26.0.6:9095 err="rpc error: code = Canceled desc = context canceled"
17:48:01 querier: level=error ts=2023-04-03T17:48:01.929005407Z caller=client.go:242 msg="error getting path" key=collectors/store-gateway err="Get \"http://e2e-mimir-test-consul:8500/v1/kv/collectors/store-gateway?index=24&stale=&wait=10000ms\": context canceled"
17:48:01 querier: level=error ts=2023-04-03T17:48:01.929028008Z caller=client.go:242 msg="error getting path" key=collectors/ring err="Get \"http://e2e-mimir-test-consul:8500/v1/kv/collectors/ring?index=18&stale=&wait=10000ms\": context canceled"

In TestQueryFrontendWithQueryShardingAndTooManyRequests we see dozens like this:

17:53:45 query-frontend: ts=2023-04-03T17:53:45.385880639Z caller=retry.go:78 level=error user=e2e-user traceID=3f3c6a4b164a9feb msg="error processing request" try=4 err="context canceled"
17:53:45 query-frontend: ts=2023-04-03T17:53:45.385983544Z caller=retry.go:78 level=error user=e2e-user traceID=3f3c6a4b164a9feb msg="error processing request" try=4 err="context canceled"
17:53:45 query-frontend: ts=2023-04-03T17:53:45.386079348Z caller=retry.go:78 level=error user=e2e-user traceID=3f3c6a4b164a9feb msg="error processing request" try=4 err="context canceled"

@lamida
Copy link
Contributor

lamida commented Apr 17, 2023

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

Could be the user closed the tab, or it's a sharded query and some other
shard errored. Whatever the cause, cancellation is not an error.

I expect two effects: the worker won't log "context canceled", and it
will process the next request immediately.
@bboreham bboreham merged commit 73f1fb6 into main Nov 23, 2023
28 checks passed
@bboreham bboreham deleted the dont-log-canceled branch November 23, 2023 10:14
Copy link
Contributor

@charleskorn charleskorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this post-merge and noticed that we likely need to make the same change for the case where the query-scheduler is in use here.

(frontendProcessor is only used by queriers when the query-scheduler is not in use.)

@bboreham
Copy link
Contributor Author

Thanks for the pointer - filed #6767

bboreham added a commit that referenced this pull request Nov 28, 2023
Mistake in #4648.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
bboreham added a commit that referenced this pull request Nov 30, 2023
Mistake in #4648.

Signed-off-by: Bryan Boreham <bjboreham@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants