Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create base error type for ingester per-instance errors and remove logging for them #5585

Merged
merged 3 commits into from
Aug 3, 2023

Conversation

56quarters
Copy link
Contributor

@56quarters 56quarters commented Jul 25, 2023

What this PR does

Create a base type for ingester per-instance limit errors. This allows us to decorate them with extra information for gRPC responses and our logging middleware (to prevent them from being logged, increasing resource usage).

Which issue(s) this PR fixes or relates to

Related #5581
Related #5551
Related weaveworks/common#299

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@56quarters 56quarters force-pushed the 56quarters/instance-errors branch 3 times, most recently from 93b4276 to 2295ff5 Compare July 31, 2023 14:11
@56quarters 56quarters marked this pull request as ready for review July 31, 2023 14:31
@56quarters 56quarters requested a review from a team as a code owner July 31, 2023 14:31
func newInstanceLimitError(err error) error {
return &instanceLimitErr{
// Errors from hitting per-instance limits are always "unavailable" for gRPC
status: status.New(codes.Unavailable, err.Error()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation for this is being able to use Unavailable for errors that count towards a circuit breaker or error throttling mechanism (discussed internally)

This allows us to decorate them with extra information for gRPC responses
and our logging middleware (to prevent them from being logged, increasing
resource usage).

Related #5581
Related weaveworks/common#299

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Comment on lines +48 to +52
func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool {
// We increment metrics when hitting per-instance limits and so there's no need to
// log them, the error doesn't contain any interesting information for us.
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, then this will remove all logging, and as such should be prominently noted in the PR description.

I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it will remove all logging for these per-instance errors. We already don't log anything for in-flight requests and all of these increment metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an alternative proposal in #5584, to log 1 in N. I think there is useful information in the specific labels that cause problems.

I don't understand, there are no labels here. This is only for per-instance (ingester) limits so there's no extra information to log beyond "this thing happened" -- unlike per-tenant limits which it seems like #5584 is addressing (and I agree, sampling is useful for that case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboreham Maybe there's a misunderstanding. This PR is about per-instance limits, which don't look be rate limited by the PR #5584 (as far as I can see).

@56quarters 56quarters changed the title Create base error type for ingester per-instance errors Create base error type for ingester per-instance errors and remove logging for them Aug 3, 2023
Comment on lines +48 to +52
func (e *instanceLimitErr) ShouldLog(context.Context, time.Duration) bool {
// We increment metrics when hitting per-instance limits and so there's no need to
// log them, the error doesn't contain any interesting information for us.
return false
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bboreham Maybe there's a misunderstanding. This PR is about per-instance limits, which don't look be rate limited by the PR #5584 (as far as I can see).

pkg/ingester/ingester.go Show resolved Hide resolved
@@ -760,6 +760,11 @@ func (i *Ingester) PushWithCleanup(ctx context.Context, pushReq *push.Request) (

db, err := i.getOrCreateTSDB(userID, false)
if err != nil {
// Check for a particular per-instance limit and return that error directly
// since it contains extra information for gRPC and our logging middleware.
if errors.Is(err, errMaxTenantsReached) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logging middleware can handle wrapped errors, but unfortunately gRPC FromError() does not. I'm wondering if what we should here to make code more generic is check if the error is implementing GRPCStatus() and, if so, do not wrap it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wrong here as well. FromError() can unwrap too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why do we need this logic at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you improve the existing tests to assert on the returned error and check that gRPC status.FromError() returns the expected code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So why do we need this logic at all?

Because wrapWithUser() as it exists today doesn't actually "wrap" the error. It does something like errors.New(err.Error()) and so it returns a brand new error that doesn't have any relation to the original error or implement the GRPCStatus() or ShouldLog() methods. I could change it but it seemed like it purposefully didn't wrap the existing error.

Could you improve the existing tests to assert on the returned error and check that gRPC status.FromError() returns the expected code?

Sure, will do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because wrapWithUser() as it exists today doesn't actually "wrap" the error.

Ouf... you're right, again.

I could change it but it seemed like it purposefully didn't wrap the existing error.

No, don't do it. I thin would be unsafe. There's some context here cortexproject/cortex#2004 but TL;DR is that some errors carry the series labels but these series labels are only safe to read during the execution of push() because they're unmarshalled from the write request into a pool. So before returning, we effectively "make a copy" to ensure the error will be safe after the push() has returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The errMaxTenantsReached and errMaxInMemorySeriesReached don't carry any label, so they're expected to be safe.

As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)

Copy link
Contributor

@colega colega Sep 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a retrospective, we could have better handled the returned error, maybe implementing a specific interface for unsafe error messages, because it's currently very obscure. But that's for another day :)

IMO we should create an issue and tackle that, as otherwise this is a recipe for a disaster, and it's just a matter of time that it explodes somewhere.

Errors are something that is meant to escape your function, we shouldn't return errors that aren't safe to escape further than some point.

Edit: issue #6008

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because the logic is OK, but I'm not sure we actually need to not wrap the per instance limit errors.

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@56quarters 56quarters merged commit fa415b0 into main Aug 3, 2023
28 checks passed
@56quarters 56quarters deleted the 56quarters/instance-errors branch August 3, 2023 17:25
@bboreham
Copy link
Contributor

bboreham commented Aug 4, 2023

Sorry I got confused between per-instance and per-tenant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants