Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest-storage alerts #7702

Merged
merged 19 commits into from
Mar 27, 2024
Merged

Ingest-storage alerts #7702

merged 19 commits into from
Mar 27, 2024

Conversation

pstibrany
Copy link
Member

@pstibrany pstibrany commented Mar 22, 2024

What this PR does

This PR adds new alerts useful when using ingest-storage:

  • Alert when ingester is having troubles while reading from Kafka
  • Alert when Ingester lag is above threshold (in “running” phase) or lag is increasing (applies to both starting / running phase)
  • Alert when ingester is unable to process records and fails with errors (ie. ingester runs into “5xx” errors)
  • Alert when we fail to enforce strong consistency

This PR also makes ingesters to ignore context cancellation as error when reading records from Kafka. This "error" is expected when ingesters are shutting down.

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

…RateTooHigh alerts.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@pstibrany pstibrany marked this pull request as ready for review March 25, 2024 12:11
@pstibrany pstibrany requested review from jdbaldry and a team as code owners March 25, 2024 12:11
@pstibrany pstibrany changed the title WIP: Ingest-storage alerts Ingest-storage alerts Mar 25, 2024
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@pracucci pracucci self-requested a review March 26, 2024 08:03
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice job! I left few minor comments on runbooks.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
alert: $.alertName('StartingIngesterKafkaReceiveDelayIncreasing'),
'for': '5m',
expr: |||
deriv(histogram_quantile(0.99, sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_receive_delay_seconds{phase="starting"}[1m])))[5m:1m]) > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking loudly: I'm wondering if it's better to use avg or 99th percentile here. Maybe there's no difference. Since we expect to consume records in order, the average (measured over the last 1m) should be good as well. Main differences that comes to my mind:

  • 99th take longer to reduce while catching up, but also takes longer to increase when lag is increasing
  • average doesn't suffer imprecise measurement given by 99th percentile when classic histogram is used

I have no strong opinion. I leave my thoughts to you. Same applies to MimirRunningIngesterReceiveDelayTooHigh.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

average doesn't suffer imprecise measurement given by 99th percentile when classic histogram is used

Note that we explicitly use native histogram in the query here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to using average (but still querying native histogram series).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but still querying native histogram series

Had to switch to computing average from classic histograms instead due to monitoring-mixins/mixtool#163

pstibrany and others added 5 commits March 26, 2024 16:06
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…asing and RunningIngesterReceiveDelayTooHigh alerts.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…layIncreasing and RunningIngesterReceiveDelayTooHigh alerts, because mixtool can't handle native histogram functions.

Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
@pstibrany pstibrany enabled auto-merge (squash) March 27, 2024 08:59
@pstibrany pstibrany merged commit 68cf90d into main Mar 27, 2024
31 checks passed
@pstibrany pstibrany deleted the ingest-storage-alerts branch March 27, 2024 09:04
// We use node_id to only alert if problems to the same Kafka node are repeating.
// If problems are for different nodes (eg. during rollout), that is not a problem, and we don't need to trigger alert.
expr: |||
sum by(%(alert_aggregation_labels)s, %(per_instance_label)s, node_id) (rate(cortex_ingest_storage_reader_read_errors_total[1m]))
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the [1m] won't work if the metrics are scraped with 1m intervals. Juraj added soem utils to make this configurable. See this example

expr: |||
increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[%s]) > 0
||| % $.alertRangeInterval(1),

can you change the intervals of the alerts to use $.alertRangeInterval(1) for 1m $.alertRangeInterval(5) for 5m, etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to use longer interval here, as I expect this is typically a problem that disappears quickly. Also I've opted for quite short for: 5m here, and if we used longer range window, that won't work.

Let's first use this, and learn if it works, I expect we will adjust these alerts few times before considering them "ready for everyone".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, as long as we don't forget 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I updated our mixin today finally wanting to enable the validation in our repo that no queries have too short range interval and ran into this :( . I currently point the mixin at top of main, if these alerts are not yet recommended for general consumption, what would be a recommendation with the mixins, to point at tags instead of top of main? Or could we add an option to exclude the alerts which are not yet ready?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to add an option to exclude these alerts. They are for a new Kafka-based ingestion mode we're working on, and it's far from ready. We added alerts because we plan to start testing it internally.

Would you send a PR with option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally got to it, sorry for the delay. I raised #7867 to add the option to exclude the alerts.

> 0.1
||| % $._config,
labels: {
severity: 'critical',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we get alerted for increased lag if there are errors? 10% of errors might be tolerable 🤷 I'm only trying to reduce the number of alerts so we don't get alert fatigue from the start

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the condition must be true for 15 minutes. I don't think 10% errors during that time is tolerable -- normally we expect 0 errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't we get alerted for increased lag if there are errors?

I client is able to read further records, then this wouldn't be true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this alerting on symptoms vs causes? Errors themselves are causes of outages, but they aren't symptoms of an outage. Increased delay or failing consistency checks are the symptoms.

We might not have to immediately address the errors if all the customer-visible metrics look ok (like delay and strong consistency guarantees)

Don't want to delve too deep here, but I suspect this alert will trigger along with some other alert and won't bring much value on its own. If you think we should keep it, then I can also disagree and commit 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I don't know yet. I guess first few incidents will tell us more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants