Skip to content

Commit

Permalink
Runbooks.
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
  • Loading branch information
pstibrany committed Mar 25, 2024
1 parent 64ca93f commit 0d3eab3
Show file tree
Hide file tree
Showing 4 changed files with 53 additions and 7 deletions.
48 changes: 47 additions & 1 deletion docs/sources/mimir/manage/mimir-runbooks/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1360,7 +1360,7 @@ How to **investigate**:

### MimirStartingIngesterKafkaReceiveDelayIncreasing

This alert fires when consumption lag reported by ingester during "starting" phase is not decreasing.
This alert fires when "receive delay" reported by ingester during "starting" phase is not decreasing.

How it **works**:

Expand All @@ -1373,6 +1373,52 @@ How to **investigate**:

- Check if ingester is fast enough to process all data in Kafka. If not, configure ingesters to start with later offset instead.

### MimirRunningIngesterReceiveDelayTooHigh

This alert fires when "receive delay" reported by ingester while it's running reaches alert threshold.

How it **works**:

- After ingester start and catches up with records in Kafka, ingester switches to "running" mode.
- In running mode, ingester continues to process incoming samples from Kafka and continues to report "receive delay". See [`MimirStartingIngesterKafkaReceiveDelayIncreasing`](#MimirStartingIngesterKafkaReceiveDelayIncreasing) runbook for details about this metric.
- Under normal conditions when ingester is running and it is processing records faster than records are appearing, receive delay should be stable.
- If observed "receive delay" increases and reaches certain threshold, alert is raised.

How to **investigate**:

- Check if ingester is fast enough to process all data in Kafka.
- If ingesters are too slow, consider scaling ingesters, either vertically (to make ingesters faster), or horizontally to spread incoming series between more ingesters.

### MimirIngesterFailsToProcessRecordsFromKafka

This alert fires when ingester is unable to process incoming records from Kafka due to internal errors. If ingest-storage wasn't used, such push requests would end up with 5xx errors.

How it **works**:

- Ingester reads records from Kafka, and processes them locally. Processing means unmarshalling the data and handling write requests stored in records.
- Write requests can fail due to "user" or "server" errors. Typical user error is too low limit for number of series. Server error can be for example ingester hitting an instance limit.
- If requests keep failing due to server errors, this alert is raised.

How to **investigate**:

- Check ingester logs to see why requests are failing, and troubleshoot based on that.

### MimirIngesterFailsEnforceStrongConsistencyOnReadPath

This alert fires when too many read-requests with strong consistency are failing.

How it **works**:

- When read request asks for strong-consistency guarantee, ingester will read the last produced offset from Kafka, and wait until record with this offset is consumed.
- If read request times out during this wait, that is considered to be a failure of request with strong-consistency.
- If requests keep failing due to failure to enforce strong-consistency, this alert is raised.

How to **investigate**:

- Check wait latency of requests with strong-consistency.
- Check if ingester needs to process too many records, and whether ingesters need to be scaled up (vertically or horizontally).
- Consider increasing read-timeout of requests.

## Errors catalog

Mimir has some codified error IDs that you might see in HTTP responses or logs.
Expand Down
4 changes: 2 additions & 2 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1005,7 +1005,7 @@ groups:
}} fails to consume write requests read from Kafka due to internal errors.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterfailstoprocessrecordsfromkafka
expr: |
sum by (cluster, namespace, instance) (rate(cortex_ingest_storage_reader_records_failed_total{cause="server"}[5m]) > 0
sum by (cluster, namespace, instance) (rate(cortex_ingest_storage_reader_records_failed_total{cause="server"}[1m]) > 0
for: 5m
labels:
severity: critical
Expand All @@ -1015,7 +1015,7 @@ groups:
}} fails to enforce strong-consistency on read-path.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterfailsenforcestrongconsistencyonreadpath
expr: |
sum by (cluster, namespace, instance) (rate(cortex_ingest_storage_strong_consistency_failures_total[5m])) > 0
sum by (cluster, namespace, instance) (rate(cortex_ingest_storage_strong_consistency_failures_total[1m])) > 0
for: 5m
labels:
severity: critical
Expand Down
4 changes: 2 additions & 2 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1018,7 +1018,7 @@ groups:
}} fails to consume write requests read from Kafka due to internal errors.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterfailstoprocessrecordsfromkafka
expr: |
sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_reader_records_failed_total{cause="server"}[5m]) > 0
sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_reader_records_failed_total{cause="server"}[1m]) > 0
for: 5m
labels:
severity: critical
Expand All @@ -1028,7 +1028,7 @@ groups:
}} fails to enforce strong-consistency on read-path.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimiringesterfailsenforcestrongconsistencyonreadpath
expr: |
sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_strong_consistency_failures_total[5m])) > 0
sum by (cluster, namespace, pod) (rate(cortex_ingest_storage_strong_consistency_failures_total[1m])) > 0
for: 5m
labels:
severity: critical
Expand Down
4 changes: 2 additions & 2 deletions operations/mimir-mixin/alerts/ingest-storage.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
alert: $.alertName('IngesterFailsToProcessRecordsFromKafka'),
'for': '5m',
expr: |||
sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_records_failed_total{cause="server"}[5m]) > 0
sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_records_failed_total{cause="server"}[1m]) > 0
||| % $._config,
labels: {
severity: 'critical',
Expand All @@ -104,7 +104,7 @@
alert: $.alertName('IngesterFailsEnforceStrongConsistencyOnReadPath'),
'for': '5m',
expr: |||
sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_strong_consistency_failures_total[5m])) > 0
sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_strong_consistency_failures_total[1m])) > 0
||| % $._config,
labels: {
severity: 'critical',
Expand Down

0 comments on commit 0d3eab3

Please sign in to comment.