Ingest-storage alerts #7702

pstibrany · 2024-03-22T17:11:13Z

What this PR does

This PR adds new alerts useful when using ingest-storage:

Alert when ingester is having troubles while reading from Kafka
Alert when Ingester lag is above threshold (in “running” phase) or lag is increasing (applies to both starting / running phase)
Alert when ingester is unable to process records and fails with errors (ie. ingester runs into “5xx” errors)
Alert when we fail to enforce strong consistency

This PR also makes ingesters to ignore context cancellation as error when reading records from Kafka. This "error" is expected when ingesters are shutting down.

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

…RateTooHigh alerts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

pracucci

Very nice job! I left few minor comments on runbooks.

docs/sources/mimir/manage/mimir-runbooks/_index.md

pracucci · 2024-03-26T08:33:14Z

operations/mimir-mixin/alerts/ingest-storage.libsonnet

+          alert: $.alertName('StartingIngesterKafkaReceiveDelayIncreasing'),
+          'for': '5m',
+          expr: |||
+            deriv(histogram_quantile(0.99, sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_receive_delay_seconds{phase="starting"}[1m])))[5m:1m]) > 0


Thinking loudly: I'm wondering if it's better to use avg or 99th percentile here. Maybe there's no difference. Since we expect to consume records in order, the average (measured over the last 1m) should be good as well. Main differences that comes to my mind:

99th take longer to reduce while catching up, but also takes longer to increase when lag is increasing

average doesn't suffer imprecise measurement given by 99th percentile when classic histogram is used

I have no strong opinion. I leave my thoughts to you. Same applies to MimirRunningIngesterReceiveDelayTooHigh.

average doesn't suffer imprecise measurement given by 99th percentile when classic histogram is used

Note that we explicitly use native histogram in the query here.

Switched to using average (but still querying native histogram series).

but still querying native histogram series

Had to switch to computing average from classic histograms instead due to monitoring-mixins/mixtool#163

Co-authored-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

…asing and RunningIngesterReceiveDelayTooHigh alerts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

…layIncreasing and RunningIngesterReceiveDelayTooHigh alerts, because mixtool can't handle native histogram functions. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

dimitarvdimitrov · 2024-03-27T16:03:57Z

operations/mimir-mixin/alerts/ingest-storage.libsonnet

+          // We use node_id to only alert if problems to the same Kafka node are repeating.
+          // If problems are for different nodes (eg. during rollout), that is not a problem, and we don't need to trigger alert.
+          expr: |||
+            sum by(%(alert_aggregation_labels)s, %(per_instance_label)s, node_id) (rate(cortex_ingest_storage_reader_read_errors_total[1m]))


the [1m] won't work if the metrics are scraped with 1m intervals. Juraj added soem utils to make this configurable. See this example

mimir/operations/mimir-mixin/alerts/alertmanager.libsonnet

Lines 83 to 85 in be893e7

expr: |||

increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[%s]) > 0

||| % $.alertRangeInterval(1),

can you change the intervals of the alerts to use $.alertRangeInterval(1) for 1m $.alertRangeInterval(5) for 5m, etc?

I really don't want to use longer interval here, as I expect this is typically a problem that disappears quickly. Also I've opted for quite short for: 5m here, and if we used longer range window, that won't work.

Let's first use this, and learn if it works, I expect we will adjust these alerts few times before considering them "ready for everyone".

sure, as long as we don't forget 👍

So I updated our mixin today finally wanting to enable the validation in our repo that no queries have too short range interval and ran into this :( . I currently point the mixin at top of main, if these alerts are not yet recommended for general consumption, what would be a recommendation with the mixins, to point at tags instead of top of main? Or could we add an option to exclude the alerts which are not yet ready?

I'm happy to add an option to exclude these alerts. They are for a new Kafka-based ingestion mode we're working on, and it's far from ready. We added alerts because we plan to start testing it internally.

Would you send a PR with option?

Will do thank you.

Finally got to it, sorry for the delay. I raised #7867 to add the option to exclude the alerts.

dimitarvdimitrov · 2024-03-27T16:08:19Z

operations/mimir-mixin/alerts/ingest-storage.libsonnet

+            > 0.1
+          ||| % $._config,
+          labels: {
+            severity: 'critical',


wouldn't we get alerted for increased lag if there are errors? 10% of errors might be tolerable 🤷 I'm only trying to reduce the number of alerts so we don't get alert fatigue from the start

Here the condition must be true for 15 minutes. I don't think 10% errors during that time is tolerable -- normally we expect 0 errors.

wouldn't we get alerted for increased lag if there are errors?

I client is able to read further records, then this wouldn't be true.

Isn't this alerting on symptoms vs causes? Errors themselves are causes of outages, but they aren't symptoms of an outage. Increased delay or failing consistency checks are the symptoms.

We might not have to immediately address the errors if all the customer-visible metrics look ok (like delay and strong consistency guarantees)

Don't want to delve too deep here, but I suspect this alert will trigger along with some other alert and won't bring much value on its own. If you think we should keep it, then I can also disagree and commit 😄

Honestly I don't know yet. I guess first few incidents will tell us more.

pstibrany added 6 commits March 25, 2024 12:45

Add IngesterFailedToReadRecordsFromKafka and IngesterKafkaFetchErrors…

97c4104

…RateTooHigh alerts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Alerts for ingester kafka lag.

fa84444

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Add alert for failures to consume write requests.

b413440

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Add alert for failures to enforce strong consistency.

124d3af

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Runbooks.

64ca93f

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Runbooks.

0d3eab3

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

pstibrany force-pushed the ingest-storage-alerts branch from bf68c41 to 0d3eab3 Compare March 25, 2024 12:11

pstibrany marked this pull request as ready for review March 25, 2024 12:11

pstibrany requested review from jdbaldry and a team as code owners March 25, 2024 12:11

pstibrany changed the title ~~WIP: Ingest-storage alerts~~ Ingest-storage alerts Mar 25, 2024

pstibrany added 8 commits March 25, 2024 13:13

Mention cases ingest-storage cases.

9a6b3e6

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Fix query.

976a162

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Add links to ingest storage.

24cd11c

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Fix range.

4bf495f

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Fix helm alerts.

aad59a4

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Fix typos.

56c6c21

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Fix runbook name.

f41860b

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Update changelog entry.

2b8f2b7

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

pracucci self-requested a review March 26, 2024 08:03

pracucci approved these changes Mar 26, 2024

View reviewed changes

pstibrany and others added 5 commits March 26, 2024 16:06

Apply suggestions from code review

99be6d4

Co-authored-by: Marco Pracucci <marco@pracucci.com>

Address review feedback.

c8b119a

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Refer to dashboard.

f87ebac

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Use average instead of p99 for StartingIngesterKafkaReceiveDelayIncre…

c91e214

…asing and RunningIngesterReceiveDelayTooHigh alerts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

Use series from classic histograms for StartingIngesterKafkaReceiveDe…

fe281bd

…layIncreasing and RunningIngesterReceiveDelayTooHigh alerts, because mixtool can't handle native histogram functions. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>

pstibrany enabled auto-merge (squash) March 27, 2024 08:59

pstibrany merged commit 68cf90d into main Mar 27, 2024
31 checks passed

pstibrany deleted the ingest-storage-alerts branch March 27, 2024 09:04

dimitarvdimitrov reviewed Mar 27, 2024

View reviewed changes

jmichalek132 mentioned this pull request Apr 10, 2024

chore: mixin conditionally load ingest storage alerts #7867

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest-storage alerts #7702

Ingest-storage alerts #7702

pstibrany commented Mar 22, 2024 •

edited

Loading

pracucci left a comment

pracucci Mar 26, 2024

pstibrany Mar 26, 2024

pstibrany Mar 27, 2024

pstibrany Mar 27, 2024

dimitarvdimitrov Mar 27, 2024 •

edited

Loading

pstibrany Mar 27, 2024

dimitarvdimitrov Mar 27, 2024

jmichalek132 Apr 3, 2024

pstibrany Apr 3, 2024

jmichalek132 Apr 3, 2024

jmichalek132 Apr 10, 2024

dimitarvdimitrov Mar 27, 2024

pstibrany Mar 27, 2024

pstibrany Mar 27, 2024

dimitarvdimitrov Mar 27, 2024

pstibrany Mar 27, 2024

	expr: \|\|\|
	increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[%s]) > 0
	\|\|\| % $.alertRangeInterval(1),

Ingest-storage alerts #7702

Ingest-storage alerts #7702

Conversation

pstibrany commented Mar 22, 2024 • edited Loading

What this PR does

Checklist

pracucci left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitarvdimitrov Mar 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pstibrany commented Mar 22, 2024 •

edited

Loading

dimitarvdimitrov Mar 27, 2024 •

edited

Loading