add metric to track out-of-space errors #8237

replay · 2024-05-31T12:29:32Z

This adds a separate metric to count how many times a compaction failed due to an out of space error.

We need to be able to alert on out of space conditions with higher criticality than we alert on compaction failures because generic compaction failures can happen due to various transient issues such as network lag and we wouldn't want to get alerted on every network lag, an out of space condition is usually not transient and it usually requires an operator to resolve the problem so it makes sense to separate that metric from other compaction failures.

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

pracucci

LGTM (assuming you've tested the error is actually captured when disk is out of space... which may not be super difficult to test in the local dev env using a very small volume for the compactor). Also remember to add a CHANGELOG entry.

pracucci · 2024-05-31T14:32:49Z

pkg/compactor/compactor.go

@@ -665,6 +674,9 @@ func (c *MultitenantCompactor) compactUsers(ctx context.Context) {
 				// We don't want to count shutdowns as failed compactions because we will pick up with the rest of the compaction after the restart.
 				level.Info(c.logger).Log("msg", "compaction for user was interrupted by a shutdown", "user", userID)
 				return
+			case errors.Is(err, syscall.ENOSPC):


Have you done any manual test to ensure the error is correctly captured, given it's something we can't easily / reliably assert with a unit test?

I added a comment documenting it

pkg/compactor/compactor.go

replay · 2024-06-03T10:17:56Z

I'll document my manual test here:

I ran a Mimir cluster in read-write mode, with the following configurations being different from the defaults to make it easier to reproduce the out-of-space condition:

  blocks_storage:
    tsdb:
      block_ranges_period: ["1m"]

  compactor:
    data_dir: /data/compactor
    block_ranges: ["1m", "3m", "6m", "30m", "1h", "2h", "6h", "12h"]
    compaction_interval: 1m
    first_level_compaction_wait_period: 1m

Then I added a volume with a size limit of 10MB to the two containers where the compactor runs and mounted it in the compactor directory /data/compactor:

"services":
  "mimir-backend-1":
    "volumes":
...
      - "compactor_data:/data/compactor"
  "mimir-backend-2":
    "volumes":
...
      - "compactor_data:/data/compactor"

volumes:
  compactor_data:
    driver: local
    driver_opts:
      o: "size=10m" 
      device: tmpfs
      type: tmpfs

After a few minutes the compactor containers started logging that they were out of disk:

% docker ps 2>&1 | grep backend             
142fa33bc20c   mimir                                     "./mimir -config.fil…"   8 minutes ago   Up 8 minutes   0.0.0.0:8007->8080/tcp             mimir-read-write-mode-mimir-backend-2-1
e45a138d3501   mimir                                     "./mimir -config.fil…"   8 minutes ago   Up 8 minutes   0.0.0.0:8006->8080/tcp             mimir-read-write-mode-mimir-backend-1-1
% docker logs 142fa33bc20c 2>&1 | grep space | tail -n 1
ts=2024-06-03T10:15:22.187259132Z caller=bucket_compactor.go:257 level=error component=compactor user=anonymous groupKey=0@17241709254077376921-merge--1717409400000-1717409460000 minTime="2024-06-03 10:10:00.662 +0000 UTC" maxTime="2024-06-03 10:11:00 +0000 UTC" msg="compaction job failed" duration=95.135333ms duration_ms=95 err="compact blocks 01HZESC4ET404ADFQCBRBHA1HQ,01HZESBVHYHJ8JHP2K4YWAQ4ZQ,01HZESBKZMETM96S0QKC35SGRN: preallocate: no space left on device"
% docker logs e45a138d3501 2>&1 | grep space | tail -n 1
ts=2024-06-03T10:15:19.613040714Z caller=bucket_compactor.go:257 level=error component=compactor user=anonymous groupKey=0@17241709254077376921-merge--1717409220000-1717409280000 minTime="2024-06-03 10:07:10.657 +0000 UTC" maxTime="2024-06-03 10:08:00 +0000 UTC" msg="compaction job failed" duration=75.618834ms duration_ms=75 err="compact blocks 01HZES646KQ2C9AEAZ8AMGTQ9G,01HZES6BWDDT3QKMDKHF30AV3B,01HZES6MNR410J9NZSGPWVZ8Z9: preallocate: no space left on device"

As expected, once the compactors started logging that they were out of disk the new metric cortex_compactor_out_of_space_errors_total also started increasing:

Co-authored-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

* add metric for out-of-space errors Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * syntax Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * better comment Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> * PR feedback Co-authored-by: Marco Pracucci <marco@pracucci.com> * add CHANGELOG entry Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> --------- Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com> Co-authored-by: Marco Pracucci <marco@pracucci.com>

replay added 3 commits May 31, 2024 14:26

add metric for out-of-space errors

833c64d

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

syntax

41ce3e5

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

better comment

ce0ba1d

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

replay marked this pull request as ready for review May 31, 2024 12:48

replay requested a review from a team as a code owner May 31, 2024 12:48

pracucci approved these changes May 31, 2024

View reviewed changes

replay and others added 2 commits June 3, 2024 12:20

PR feedback

7981e41

Co-authored-by: Marco Pracucci <marco@pracucci.com>

add CHANGELOG entry

e378817

Signed-off-by: Mauro Stettler <mauro.stettler@gmail.com>

replay merged commit 521be6a into main Jun 3, 2024
29 checks passed

replay deleted the add-metrics-to-count-out-of-disk-errors branch June 3, 2024 11:47

replay mentioned this pull request Jun 5, 2024

alert if compactor runs out of disk space #8278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add metric to track out-of-space errors #8237

add metric to track out-of-space errors #8237

replay commented May 31, 2024

pracucci left a comment

pracucci May 31, 2024

replay Jun 3, 2024 •

edited

Loading

replay commented Jun 3, 2024

add metric to track out-of-space errors #8237

add metric to track out-of-space errors #8237

Conversation

replay commented May 31, 2024

pracucci left a comment

Choose a reason for hiding this comment

pracucci May 31, 2024

Choose a reason for hiding this comment

replay Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

replay commented Jun 3, 2024

replay Jun 3, 2024 •

edited

Loading