Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add rule for critical distributor inflight push request alert #408

Merged
merged 6 commits into from
Oct 22, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions cortex-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -352,6 +352,30 @@
},
],
},
{
name: 'cortex_distributor_inflight_push_request_alert',
treid314 marked this conversation as resolved.
Show resolved Hide resolved
rules: [
{
alert: 'CortexDistributorReachingInflightPushRequestLimits',
treid314 marked this conversation as resolved.
Show resolved Hide resolved
expr: |||
(
(cortex_distributor_inflight_push_requests / ignoring(limit) cortex_distributor_instance_limits{limit="max_inflight_push_requests"})
and ignoring (limit)
(cortex_distributor_instance_limits{limit="max_inflight_push_requests"} > 0)
) > 0.8
|||,
'for': '5m',
labels: {
severity: 'critical',
},
annotations: {
message: |||
Distributor {{ $labels.job }}/{{ $labels.instance }} has reached {{ $value | humanizePercentage }} of its inflight push request limit.
|||,
},
},
],
},
{
name: 'cortex_wal_alerts',
rules: [
Expand Down
31 changes: 31 additions & 0 deletions cortex-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,37 @@ How to **fix**:
1. Ensure shuffle-sharding is enabled in the Cortex cluster
1. Assuming shuffle-sharding is enabled, scaling up ingesters will lower the number of tenants per ingester. However, the effect of this change will be visible only after `-blocks-storage.tsdb.close-idle-tsdb-timeout` period so you may have to temporarily increase the limit

### CortexDistributorReachingInflightPushRequestLimits
treid314 marked this conversation as resolved.
Show resolved Hide resolved

This alert fires when the `cortex_distributor_inflight_push_requests` per distributor instance limit is enabled and the actual number of inflight push requests is approaching the set limit. Once the limit is reached, push requests to the distributor will fail (5xx) for new requests, while existing inflight push requests will continue to succeed.

In case of **emergency**:
- If the actual number of inflight push requests is very close to or already at the set limit, then you can increase the limit via runtime config to gain some time
treid314 marked this conversation as resolved.
Show resolved Hide resolved
- Increasing the limit will increase the distributor' memory utilization. Please monitor the distributors' memory utilization via the `Cortex / Writes Resources` dashboard
treid314 marked this conversation as resolved.
Show resolved Hide resolved

How the limit is **configured**:
- The limit can be configured either on CLI (`-distributor.instance-limits.max-inflight-push-requests`) or in the runtime config:
```
distributor_instance_limits:
max_inflight_push_requests: <int>
```
- The mixin configures the limit in the runtime config and can be fine-tuned via:
treid314 marked this conversation as resolved.
Show resolved Hide resolved
```
_config+:: {
distributor_instance_limits+:: {
max_inflight_push_requests: <int>
}
}
```
- When configured in the runtime config, changes are applied live without requiring an distributor restart
- The configured limit can be queried via `cortex_distributor_instance_limits{limit="max_inflight_push_requests"})`

How to **fix**:
1. **Temporarily increase the limit**<br />
If the actual number of inflight push requests is very close to or already hit the limit.
2. **Scale up distributors**<br />
Scaling up distributors will lower the number of inflight push requests per distributor.

### CortexRequestLatency

This alert fires when a specific Cortex route is experiencing an high latency.
Expand Down