From e07c62ff4b8e0675add20e4590ed7a2fb99740c0 Mon Sep 17 00:00:00 2001 From: Dimitar Dimitrov Date: Thu, 17 Mar 2022 12:31:17 +0100 Subject: [PATCH 1/4] Add OoO instructions for MimirCompactorHasNotSuccessfullyRunCompaction --- operations/mimir-mixin/docs/playbooks.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/operations/mimir-mixin/docs/playbooks.md b/operations/mimir-mixin/docs/playbooks.md index 77a8d2fac02..80d734fb86b 100644 --- a/operations/mimir-mixin/docs/playbooks.md +++ b/operations/mimir-mixin/docs/playbooks.md @@ -455,6 +455,18 @@ How to **investigate**: - Look for any error in the compactor logs - Corruption: [`not healthy index found`](#compactor-is-failing-because-of-not-healthy-index-found) + - Invalid result block: + - **How to detect**: Search compactor logs for `invalid result block`: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22,%20namespace%3D%5C%22%3Cnamespace%3E%5C%22%7D%20%7C%3D%20%5C%22invalid%20result%20block%5C%22%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace `` with the namespace from the alert) + - **What it means**: The compactor successfully validated the source blocks. But the validation on the result block after the compaction did not succeed. The result block was not uploaded and the compaction job will be retried. + - Out-of-order chunks + - Search compactor logs for `out-of-order chunks`: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22,%20namespace%3D%5C%22%3Cnamespace%3E%5C%22%7D%20%7C%3D%20%5C%22invalid%20result%20block%5C%22%20%7C%3D%20%5C%22out-of-order%20chunks%5C%22%20%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace ``) + - This is caused by [a bug in the ingester](https://github.com/grafana/mimir-squad/issues/453#issuecomment-1015193060). Ingesters upload blocks where the MinT and MaxT of some chunks don't match the first and last samples in the chunk. When the faulty chunks' MinT and MaxT overlap with other chunks, the compactor merges the chunks. Because one chunks's MinT and MaxT are incorrect the merge may be performed incorrectly, leading to OoO samples. + - **How to mitigate**: Mark the faulty blocks to avoid compaction: + - Find all affected compaction jobs: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22,%20namespace%3D%5C%22%3Cnamespace%3E%5C%22%7D%20%7C%3D%20%5C%22invalid%20result%20block%5C%22%20%7C%3D%20%5C%22out-of-order%20chunks%5C%22%20%7C%20pattern%20%60%3C_%3E%20invalid%20result%20block%20%2Fdata%2Fcompact%2F%3Ccompaction_group%3E%2F%3Cresult_block%3E:%20%3C_%3E%60%20%7C%20line_format%20%60%7B%7B.compaction_group%7D%7D%20%7B%7B.result_block%7D%7D%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace ``) + - For each failed compaction job + - Pick one result block (doesn't matter which) + - Find source blocks for the compaction job: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22%7D%20%7C%3D%20%6001FYBQDRBD3DN2BE5Q3CFY5G0W%60%20%7C%3D%20%5C%22compact%20blocks%5C%22%20%7C%20logfmt%20%7C%20line_format%20%60%7B%7B.sources%7D%7D%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace ``) + - Follow instructions in the description of https://github.com/grafana/mimir-squad/issues/453 to upload `no-compact-mark.json` files to the faulty source blocks. ### MimirCompactorSkippedBlocksWithOutOfOrderChunks From 911438f6fc60ef061feeb53bc0d5960170a736e4 Mon Sep 17 00:00:00 2001 From: Dimitar Dimitrov Date: Fri, 18 Mar 2022 12:14:16 +0100 Subject: [PATCH 2/4] Strip out private links Signed-off-by: Dimitar Dimitrov --- operations/mimir-mixin/docs/playbooks.md | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/operations/mimir-mixin/docs/playbooks.md b/operations/mimir-mixin/docs/playbooks.md index 80d734fb86b..e64c4f3cb51 100644 --- a/operations/mimir-mixin/docs/playbooks.md +++ b/operations/mimir-mixin/docs/playbooks.md @@ -456,17 +456,20 @@ How to **investigate**: - Look for any error in the compactor logs - Corruption: [`not healthy index found`](#compactor-is-failing-because-of-not-healthy-index-found) - Invalid result block: - - **How to detect**: Search compactor logs for `invalid result block`: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22,%20namespace%3D%5C%22%3Cnamespace%3E%5C%22%7D%20%7C%3D%20%5C%22invalid%20result%20block%5C%22%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace `` with the namespace from the alert) + - **How to detect**: Search compactor logs for `invalid result block`. - **What it means**: The compactor successfully validated the source blocks. But the validation on the result block after the compaction did not succeed. The result block was not uploaded and the compaction job will be retried. - Out-of-order chunks - - Search compactor logs for `out-of-order chunks`: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22,%20namespace%3D%5C%22%3Cnamespace%3E%5C%22%7D%20%7C%3D%20%5C%22invalid%20result%20block%5C%22%20%7C%3D%20%5C%22out-of-order%20chunks%5C%22%20%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace ``) - - This is caused by [a bug in the ingester](https://github.com/grafana/mimir-squad/issues/453#issuecomment-1015193060). Ingesters upload blocks where the MinT and MaxT of some chunks don't match the first and last samples in the chunk. When the faulty chunks' MinT and MaxT overlap with other chunks, the compactor merges the chunks. Because one chunks's MinT and MaxT are incorrect the merge may be performed incorrectly, leading to OoO samples. - - **How to mitigate**: Mark the faulty blocks to avoid compaction: - - Find all affected compaction jobs: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22,%20namespace%3D%5C%22%3Cnamespace%3E%5C%22%7D%20%7C%3D%20%5C%22invalid%20result%20block%5C%22%20%7C%3D%20%5C%22out-of-order%20chunks%5C%22%20%7C%20pattern%20%60%3C_%3E%20invalid%20result%20block%20%2Fdata%2Fcompact%2F%3Ccompaction_group%3E%2F%3Cresult_block%3E:%20%3C_%3E%60%20%7C%20line_format%20%60%7B%7B.compaction_group%7D%7D%20%7B%7B.result_block%7D%7D%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace ``) + - **How to detect**: Search compactor logs for `invalid result block` and `out-of-order chunks`. + - This is caused by a bug in the ingester. Ingesters upload blocks where the MinT and MaxT of some chunks don't match the first and last samples in the chunk. When the faulty chunks' MinT and MaxT overlap with other chunks, the compactor merges the chunks. Because one chunks's MinT and MaxT are incorrect the merge may be performed incorrectly, leading to OoO samples. + - **How to mitigate**: Mark the faulty blocks to avoid compacting them in the future: + - Find all affected compaction groups in the compactor logs. You will find them as `invalid result block /data/compact//`. - For each failed compaction job - Pick one result block (doesn't matter which) - - Find source blocks for the compaction job: [explore](https://admin-ops-us-east-0.grafana.net/grafana/explore?orgId=1&left=%7B%22datasource%22:%22loki-ops%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bjob%3D~%5C%22.*%2Fcompactor%5C%22%7D%20%7C%3D%20%6001FYBQDRBD3DN2BE5Q3CFY5G0W%60%20%7C%3D%20%5C%22compact%20blocks%5C%22%20%7C%20logfmt%20%7C%20line_format%20%60%7B%7B.sources%7D%7D%60%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D) (replace ``) - - Follow instructions in the description of https://github.com/grafana/mimir-squad/issues/453 to upload `no-compact-mark.json` files to the faulty source blocks. + - Find source blocks for the compaction job: search for `msg="compact blocks"` and a mention of the result block ID. + - Upload a JSON file to the markers directory of the compactor: `/markers/-no-compact-mark.json`. The format of the file follows. Replace the `id` and `no_compact_time`: + ```json + {"id":"01FYAFBE9F0VH6555R3J1CFPHP","version":1,"details":"When compacting with other blocks is leading to out-of-order chunks","no_compact_time":1647514725,"reason":"manual"} + ``` ### MimirCompactorSkippedBlocksWithOutOfOrderChunks From cecd5f4c1e446848cdd1efcc02f3306f18a58450 Mon Sep 17 00:00:00 2001 From: Dimitar Dimitrov Date: Fri, 18 Mar 2022 13:06:39 +0100 Subject: [PATCH 3/4] Address PR comments Signed-off-by: Dimitar Dimitrov --- operations/mimir-mixin/docs/playbooks.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/operations/mimir-mixin/docs/playbooks.md b/operations/mimir-mixin/docs/playbooks.md index e64c4f3cb51..efd247db7a1 100644 --- a/operations/mimir-mixin/docs/playbooks.md +++ b/operations/mimir-mixin/docs/playbooks.md @@ -457,10 +457,10 @@ How to **investigate**: - Corruption: [`not healthy index found`](#compactor-is-failing-because-of-not-healthy-index-found) - Invalid result block: - **How to detect**: Search compactor logs for `invalid result block`. - - **What it means**: The compactor successfully validated the source blocks. But the validation on the result block after the compaction did not succeed. The result block was not uploaded and the compaction job will be retried. + - **What it means**: The compactor successfully validated the source blocks. But the validation of the result block after the compaction did not succeed. The result block was not uploaded and the compaction job will be retried. - Out-of-order chunks - **How to detect**: Search compactor logs for `invalid result block` and `out-of-order chunks`. - - This is caused by a bug in the ingester. Ingesters upload blocks where the MinT and MaxT of some chunks don't match the first and last samples in the chunk. When the faulty chunks' MinT and MaxT overlap with other chunks, the compactor merges the chunks. Because one chunks's MinT and MaxT are incorrect the merge may be performed incorrectly, leading to OoO samples. + - This is caused by a bug in the ingester. Ingesters upload blocks where the MinT and MaxT of some chunks don't match the first and last samples in the chunk. When the faulty chunks' MinT and MaxT overlap with other chunks, the compactor merges the chunks. Because one chunk's MinT and MaxT are incorrect the merge may be performed incorrectly, leading to OoO samples. - **How to mitigate**: Mark the faulty blocks to avoid compacting them in the future: - Find all affected compaction groups in the compactor logs. You will find them as `invalid result block /data/compact//`. - For each failed compaction job From 41fa71a3b5b11a0f8d2530e215f2cc540844860a Mon Sep 17 00:00:00 2001 From: Dimitar Dimitrov Date: Fri, 18 Mar 2022 13:18:26 +0100 Subject: [PATCH 4/4] Fix JSON formatting Signed-off-by: Dimitar Dimitrov --- operations/mimir-mixin/docs/playbooks.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/operations/mimir-mixin/docs/playbooks.md b/operations/mimir-mixin/docs/playbooks.md index efd247db7a1..c9efb52274d 100644 --- a/operations/mimir-mixin/docs/playbooks.md +++ b/operations/mimir-mixin/docs/playbooks.md @@ -468,7 +468,13 @@ How to **investigate**: - Find source blocks for the compaction job: search for `msg="compact blocks"` and a mention of the result block ID. - Upload a JSON file to the markers directory of the compactor: `/markers/-no-compact-mark.json`. The format of the file follows. Replace the `id` and `no_compact_time`: ```json - {"id":"01FYAFBE9F0VH6555R3J1CFPHP","version":1,"details":"When compacting with other blocks is leading to out-of-order chunks","no_compact_time":1647514725,"reason":"manual"} + { + "id": "01FYAFBE9F0VH6555R3J1CFPHP", + "version": 1, + "details": "When compacting with other blocks is leading to out-of-order chunks", + "no_compact_time": 1647514725, + "reason": "manual" + } ``` ### MimirCompactorSkippedBlocksWithOutOfOrderChunks