Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric to track out-of-space errors #8237

Merged
merged 5 commits into from
Jun 3, 2024

Conversation

replay
Copy link
Contributor

@replay replay commented May 31, 2024

This adds a separate metric to count how many times a compaction failed due to an out of space error.

We need to be able to alert on out of space conditions with higher criticality than we alert on compaction failures because generic compaction failures can happen due to various transient issues such as network lag and we wouldn't want to get alerted on every network lag, an out of space condition is usually not transient and it usually requires an operator to resolve the problem so it makes sense to separate that metric from other compaction failures.

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
@replay replay marked this pull request as ready for review May 31, 2024 12:48
@replay replay requested a review from a team as a code owner May 31, 2024 12:48
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (assuming you've tested the error is actually captured when disk is out of space... which may not be super difficult to test in the local dev env using a very small volume for the compactor). Also remember to add a CHANGELOG entry.

@@ -665,6 +674,9 @@ func (c *MultitenantCompactor) compactUsers(ctx context.Context) {
// We don't want to count shutdowns as failed compactions because we will pick up with the rest of the compaction after the restart.
level.Info(c.logger).Log("msg", "compaction for user was interrupted by a shutdown", "user", userID)
return
case errors.Is(err, syscall.ENOSPC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you done any manual test to ensure the error is correctly captured, given it's something we can't easily / reliably assert with a unit test?

Copy link
Contributor Author

@replay replay Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/compactor/compactor.go Outdated Show resolved Hide resolved
@replay
Copy link
Contributor Author

replay commented Jun 3, 2024

I'll document my manual test here:

I ran a Mimir cluster in read-write mode, with the following configurations being different from the defaults to make it easier to reproduce the out-of-space condition:

  blocks_storage:
    tsdb:
      block_ranges_period: ["1m"]

  compactor:
    data_dir: /data/compactor
    block_ranges: ["1m", "3m", "6m", "30m", "1h", "2h", "6h", "12h"]
    compaction_interval: 1m
    first_level_compaction_wait_period: 1m

Then I added a volume with a size limit of 10MB to the two containers where the compactor runs and mounted it in the compactor directory /data/compactor:

"services":
  "mimir-backend-1":
    "volumes":
...
      - "compactor_data:/data/compactor"
  "mimir-backend-2":
    "volumes":
...
      - "compactor_data:/data/compactor"

volumes:
  compactor_data:
    driver: local
    driver_opts:
      o: "size=10m" 
      device: tmpfs
      type: tmpfs

After a few minutes the compactor containers started logging that they were out of disk:

% docker ps 2>&1 | grep backend             
142fa33bc20c   mimir                                     "./mimir -config.fil…"   8 minutes ago   Up 8 minutes   0.0.0.0:8007->8080/tcp             mimir-read-write-mode-mimir-backend-2-1
e45a138d3501   mimir                                     "./mimir -config.fil…"   8 minutes ago   Up 8 minutes   0.0.0.0:8006->8080/tcp             mimir-read-write-mode-mimir-backend-1-1
% docker logs 142fa33bc20c 2>&1 | grep space | tail -n 1
ts=2024-06-03T10:15:22.187259132Z caller=bucket_compactor.go:257 level=error component=compactor user=anonymous groupKey=0@17241709254077376921-merge--1717409400000-1717409460000 minTime="2024-06-03 10:10:00.662 +0000 UTC" maxTime="2024-06-03 10:11:00 +0000 UTC" msg="compaction job failed" duration=95.135333ms duration_ms=95 err="compact blocks 01HZESC4ET404ADFQCBRBHA1HQ,01HZESBVHYHJ8JHP2K4YWAQ4ZQ,01HZESBKZMETM96S0QKC35SGRN: preallocate: no space left on device"
% docker logs e45a138d3501 2>&1 | grep space | tail -n 1
ts=2024-06-03T10:15:19.613040714Z caller=bucket_compactor.go:257 level=error component=compactor user=anonymous groupKey=0@17241709254077376921-merge--1717409220000-1717409280000 minTime="2024-06-03 10:07:10.657 +0000 UTC" maxTime="2024-06-03 10:08:00 +0000 UTC" msg="compaction job failed" duration=75.618834ms duration_ms=75 err="compact blocks 01HZES646KQ2C9AEAZ8AMGTQ9G,01HZES6BWDDT3QKMDKHF30AV3B,01HZES6MNR410J9NZSGPWVZ8Z9: preallocate: no space left on device"

As expected, once the compactors started logging that they were out of disk the new metric cortex_compactor_out_of_space_errors_total also started increasing:

image

replay and others added 2 commits June 3, 2024 12:20
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
@replay replay merged commit 521be6a into main Jun 3, 2024
29 checks passed
@replay replay deleted the add-metrics-to-count-out-of-disk-errors branch June 3, 2024 11:47
narqo pushed a commit to narqo/grafana-mimir that referenced this pull request Jun 6, 2024
* add metric for out-of-space errors

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* syntax

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* better comment

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* PR feedback

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* add CHANGELOG entry

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

---------

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants