-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest-storage alerts #7702
Ingest-storage alerts #7702
Conversation
…RateTooHigh alerts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
bf68c41
to
0d3eab3
Compare
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice job! I left few minor comments on runbooks.
alert: $.alertName('StartingIngesterKafkaReceiveDelayIncreasing'), | ||
'for': '5m', | ||
expr: ||| | ||
deriv(histogram_quantile(0.99, sum by (%(alert_aggregation_labels)s, %(per_instance_label)s) (rate(cortex_ingest_storage_reader_receive_delay_seconds{phase="starting"}[1m])))[5m:1m]) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking loudly: I'm wondering if it's better to use avg or 99th percentile here. Maybe there's no difference. Since we expect to consume records in order, the average (measured over the last 1m) should be good as well. Main differences that comes to my mind:
- 99th take longer to reduce while catching up, but also takes longer to increase when lag is increasing
- average doesn't suffer imprecise measurement given by 99th percentile when classic histogram is used
I have no strong opinion. I leave my thoughts to you. Same applies to MimirRunningIngesterReceiveDelayTooHigh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
average doesn't suffer imprecise measurement given by 99th percentile when classic histogram is used
Note that we explicitly use native histogram in the query here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to using average (but still querying native histogram series).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but still querying native histogram series
Had to switch to computing average from classic histograms instead due to monitoring-mixins/mixtool#163
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…asing and RunningIngesterReceiveDelayTooHigh alerts. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
…layIncreasing and RunningIngesterReceiveDelayTooHigh alerts, because mixtool can't handle native histogram functions. Signed-off-by: Peter Štibraný <pstibrany@gmail.com>
// We use node_id to only alert if problems to the same Kafka node are repeating. | ||
// If problems are for different nodes (eg. during rollout), that is not a problem, and we don't need to trigger alert. | ||
expr: ||| | ||
sum by(%(alert_aggregation_labels)s, %(per_instance_label)s, node_id) (rate(cortex_ingest_storage_reader_read_errors_total[1m])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the [1m]
won't work if the metrics are scraped with 1m intervals. Juraj added soem utils to make this configurable. See this example
mimir/operations/mimir-mixin/alerts/alertmanager.libsonnet
Lines 83 to 85 in be893e7
expr: ||| | |
increase(cortex_alertmanager_state_initial_sync_completed_total{outcome="failed"}[%s]) > 0 | |
||| % $.alertRangeInterval(1), |
can you change the intervals of the alerts to use $.alertRangeInterval(1)
for 1m
$.alertRangeInterval(5)
for 5m
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't want to use longer interval here, as I expect this is typically a problem that disappears quickly. Also I've opted for quite short for: 5m
here, and if we used longer range window, that won't work.
Let's first use this, and learn if it works, I expect we will adjust these alerts few times before considering them "ready for everyone".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, as long as we don't forget 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I updated our mixin today finally wanting to enable the validation in our repo that no queries have too short range interval and ran into this :( . I currently point the mixin at top of main, if these alerts are not yet recommended for general consumption, what would be a recommendation with the mixins, to point at tags instead of top of main? Or could we add an option to exclude the alerts which are not yet ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add an option to exclude these alerts. They are for a new Kafka-based ingestion mode we're working on, and it's far from ready. We added alerts because we plan to start testing it internally.
Would you send a PR with option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally got to it, sorry for the delay. I raised #7867 to add the option to exclude the alerts.
> 0.1 | ||
||| % $._config, | ||
labels: { | ||
severity: 'critical', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we get alerted for increased lag if there are errors? 10% of errors might be tolerable 🤷 I'm only trying to reduce the number of alerts so we don't get alert fatigue from the start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the condition must be true for 15 minutes. I don't think 10% errors during that time is tolerable -- normally we expect 0 errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we get alerted for increased lag if there are errors?
I client is able to read further records, then this wouldn't be true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this alerting on symptoms vs causes? Errors themselves are causes of outages, but they aren't symptoms of an outage. Increased delay or failing consistency checks are the symptoms.
We might not have to immediately address the errors if all the customer-visible metrics look ok (like delay and strong consistency guarantees)
Don't want to delve too deep here, but I suspect this alert will trigger along with some other alert and won't bring much value on its own. If you think we should keep it, then I can also disagree and commit 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I don't know yet. I guess first few incidents will tell us more.
What this PR does
This PR adds new alerts useful when using ingest-storage:
This PR also makes ingesters to ignore context cancellation as error when reading records from Kafka. This "error" is expected when ingesters are shutting down.
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.