Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: selective and fast logs deduplicator #1900

Open
pracucci opened this issue May 20, 2022 · 3 comments
Open

Proposal: selective and fast logs deduplicator #1900

pracucci opened this issue May 20, 2022 · 3 comments

Comments

@pracucci
Copy link
Collaborator

Problem

A high traffic Mimir cluster can log a lot. For example, I've analysed the log rate of a medium-size cluster (with a good % of requests return 4xx because of some limits hit or out of order/bounds samples written) running with -log.level=info and the vaste majority of logs come from 2 sources:

All other logging callers are orders of magnitude less noisy.

Data as been queried from Loki:

sum by(caller) (rate({namespace="REDACTED"} | logfmt | __error__="" [5m]))

Screenshot 2022-05-20 at 10 03 34

Logs are very important and useful when debugging, but repeating the same log hundreds or thousands of times per second is not much useful, other than adding pressure to the system.

Proposal

I propose to build a logs deduplicator in Mimir, following these design principles:

  • Selective: deduplicate only logs from grpc_logging.go and push.go (in the future can be plugged in other places, if required).
  • Intelligent: deduplicate logs of the same "type" but not necessarily with the same exact log message (we don't many exactly equal log messages in our use cases).
  • Fast: ideally, should be a positive-sum change. The overhead introduced by the deduplicator should be absorbed by the reduced pressure on the downstream logging pipeline.

Intelligent

An example log:

level=warn ts=2022-05-19T14:38:39.174301048Z caller=grpc_logging.go:38 method=/cortex.Ingester/Push duration=1.174705ms err="rpc error: code = Code(400) desc = user=tenant-1: err: out of order sample. timestamp=2022-05-19T14:38:21.89Z, series={__name__=\"loki_ingester_chunk_size_bytes_sum\", pod=\"ingester-1\"}" msg=gRPC

The deduplication key for the example log above should be composed only by:

  • code=400
  • user=tenant-1
  • err: out of order sample

Discussion on actual proposed implementation will follow.

@bboreham
Copy link
Contributor

In the past we failed by not excluding the duration from dedupe.

@replay
Copy link
Contributor

replay commented May 30, 2022

Would a de-duplicated log line have a field which indicates how many lines have been de-duplicated into one? I think that can still be important to know in some cases.

@pracucci
Copy link
Collaborator Author

Would a de-duplicated log line have a field which indicates how many lines have been de-duplicated into one?

Definitely yes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants