Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingester: don't log errors that cause OOMs, using interface #5581

Merged
merged 2 commits into from
Jul 31, 2023

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Jul 25, 2023

What this PR does

Not-ready and over-max-inflight can be emitted when the server is overloaded, in which case we don't want to spend more resource logging the fact.

The max-inflight errors will be recorded in metrics added in #5551.

Temporarily using a branch from upstream weaveworks/common, to check Mimir still passes CI.
Update on #5494

Checklist

  • NA Tests updated
  • NA Documentation added
  • NA CHANGELOG.md updated

@bboreham bboreham requested a review from a team as a code owner July 25, 2023 15:00
Copy link
Contributor

@56quarters 56quarters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like this approach compared to other one (adding DoNotLogError in common).

@@ -17,6 +17,11 @@ const (
errorKey = "err"
)

// If an error implements Observe(), it will get called and GRPCServerLog will do nothing.
type Observer interface {
Copy link
Contributor

@56quarters 56quarters Jul 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document (or codify) some of the expectations of implementers of this? For example, we expect if you implement Observer you also implement error and have a .Unwrap() method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect if you implement Observer you also implement error and have a .Unwrap() method

I don't think we expect that, right? You'll have to implement error because this is checked on an error, but we don't need that. And there's no need for the Unwrap() method.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looking at #5584, I’m not a big fan of Observe(). We’re going to have the logging logic scattered around. I’m wondering if we could simply achieve the same with a function which tells whether the error should be logged or not (return bool), instead of delegating the log to the error's Observe().

@bboreham
Copy link
Contributor Author

Also looking at #5584, I’m not a big fan of Observe().

How about ShouldLog() ? weaveworks/common@9927ce6

@@ -45,6 +44,7 @@ import (
"github.com/grafana/mimir/pkg/mimirpb"
"github.com/grafana/mimir/pkg/util"
"github.com/grafana/mimir/pkg/util/globalerror"
log_util "github.com/grafana/mimir/pkg/util/log"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

util_log for consistency.

@pracucci
Copy link
Collaborator

Also looking at #5584, I’m not a big fan of Observe().

How about ShouldLog() ? weaveworks/common@9927ce6

Looks better to me. Do we need to pass duration time.Duration? What will be used for? I can't see it used in #5584.

@bboreham
Copy link
Contributor Author

Do we need to pass duration time.Duration? What will be used for?

I took the five parameters from the previous version, and tried to imagine which ones could be used.
"I want to log all operations over 5 seconds" seemed possible, so I left the duration in.
Getting something from the Context, perhaps a tracing span, also seemed likely.

I know it's YAGNI, but we can never change that interface (that would silently break all users).
We can add more interfaces.

56quarters added a commit that referenced this pull request Jul 27, 2023
This allows us to decorate them with extra information for gRPC responses
and our logging middleware (to prevent them from being logged, increasing
resource usage).

Related #5581
Related weaveworks/common#299

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
@@ -91,9 +83,6 @@ func (l Log) Wrap(next http.Handler) http.Handler {
statusCode, writeErr := wrapped.getStatusCode(), wrapped.getWriteError()

if writeErr != nil {
if errors.Is(writeErr, DoNotLogError{}) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't required anymore? Can you explain why was it removed from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is the return from Write(); I don't see how it can be one of our wrapped errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I looked at the high-volume log lines that have been problemmatic and they were all from grpc_logging, none from http request logging.

@pracucci
Copy link
Collaborator

I pushed a tiny change to a test, to improve it. Hope you don't mind: 0108bb6

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo a question)

@colega
Copy link
Contributor

colega commented Jul 28, 2023

ShouldLog also works for me, although I don't like the fact that the error won't say the sampling. When you're reading the logs with no context, you might easily miss the fact that the error is sampled if the error doesn't say that explicitly.

@pracucci
Copy link
Collaborator

ShouldLog also works for me, although I don't like the fact that the error won't say the sampling. When you're reading the logs with no context, you might easily miss the fact that the error is sampled if the error doesn't say that explicitly.

I agree (also mentioned in the related design doc). I was waiting for the next PR that will introduce the sampling, to stress about this ;)

@bboreham
Copy link
Contributor Author

Upstream PR in weaveworks/common has merged, so I rebased. Please check I didn't drop any of your commits.

the next PR that will introduce the sampling

PR #5584 is sampling 6 or so errors.

Comment on lines -278 to -273
// Use a fork of weaveworks/common while we work out if there is a better design for https://github.com/weaveworks/common/pull/293
replace github.com/weaveworks/common => github.com/weaveworks/common v0.0.0-20230714173453-d1f8877b91ce
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: removed because weaveworks/common#299 has been merged.

bboreham and others added 2 commits July 31, 2023 08:52
Using weaveworks/common#299 which is a more
flexible approach.

Note the check disappeared from `logging.go`, because it was a mistake
to check that error.  It comes from `io.Writer`, it won't be an app-
level error.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@bboreham bboreham merged commit 1e5298e into main Jul 31, 2023
28 checks passed
@bboreham bboreham deleted the no-log-error2 branch July 31, 2023 13:09
56quarters added a commit that referenced this pull request Jul 31, 2023
This allows us to decorate them with extra information for gRPC responses
and our logging middleware (to prevent them from being logged, increasing
resource usage).

Related #5581
Related weaveworks/common#299

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this pull request Aug 3, 2023
This allows us to decorate them with extra information for gRPC responses
and our logging middleware (to prevent them from being logged, increasing
resource usage).

Related #5581
Related weaveworks/common#299

Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>
56quarters added a commit that referenced this pull request Aug 3, 2023
…gging for them (#5585)

This allows us to decorate them with extra information for gRPC responses
and our logging middleware (to prevent them from being logged which is
expensive).

Related #5581
Related weaveworks/common#299
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants