-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add playbook for CortexRequestErrors and config option to exclude specific routes #338
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -64,5 +64,8 @@ | |
writes: true, | ||
reads: true, | ||
}, | ||
|
||
// The routes to exclude from alerts. | ||
alert_excluded_routes: [], | ||
}, | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,7 +109,18 @@ Right now most of the execution time will be spent in PromQL's innerEval. NB tha | |
|
||
### CortexRequestErrors | ||
|
||
_TODO: this playbook has not been written yet._ | ||
This alert fires when the rate of 5xx errors of a specific route is > 1% for some time. | ||
|
||
This alert typically acts as a last resort to detect issues / outages. SLO alerts are expected to trigger earlier: if an **SLO alert** has triggered as well for the same read/write path, then you can ignore this alert and focus on the SLO one. | ||
|
||
How to **investigate**: | ||
- Check for which route the alert fired | ||
- Write path: open the `Cortex / Writes` dashboard | ||
- Read path: open the `Cortex / Reads` dashboard | ||
- Looking at the dashboard you should see in which Cortex service the error originates | ||
- The panels in the dashboard are vertically sorted by the network path (eg. on the write path: cortex-gw -> distributor -> ingester) | ||
- If the failing service is going OOM (`OOMKilled`): scale up or increase the memory | ||
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there | ||
|
||
### CortexTransferFailed | ||
This alert goes off when an ingester fails to find another node to transfer its data to when it was shutting down. If there is both a pod stuck terminating and one stuck joining, look at the kubernetes events. This may be due to scheduling problems caused by some combination of anti affinity rules/resource utilization. Adding a new node can help in these circumstances. You can see recent events associated with a resource via kubectl describe, ex: `kubectl -n <namespace> describe pod <pod>` | ||
|
@@ -355,10 +366,6 @@ WAL corruptions are only detected at startups, so at this point the WAL/Checkpoi | |
2. Equal or more than the quorum number but less than replication factor: There is a good chance that there is no data loss if it was replicated to desired number of ingesters. But it's good to check once for data loss. | ||
3. Equal or more than the replication factor: Then there is definitely some data loss. | ||
|
||
### CortexRequestErrors | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Was duplicated. |
||
|
||
_TODO: this playbook has not been written yet._ | ||
|
||
### CortexTableSyncFailure | ||
|
||
_This alert applies to Cortex chunks storage only._ | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is 338.