Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixin/Playbooks: Add Alertmanager suggestions for MimirRequestErrors and MimirRequestLatency #1702

Merged
merged 5 commits into from
Apr 14, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
- `MimirContinuousTestFailed`
* [ENHANCEMENT] Added `per_cluster_label` support to allow to change the label name used to differentiate between Kubernetes clusters. #1651
* [ENHANCEMENT] Dashboards: Show QPS and latency of the Alertmanager Distributor. #1696
* [ENHANCEMENT] Playbooks: Add Alertmanager suggestions for `MimirRequestErrors` and `MimirRequestLatency` #1702
* [BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629

### Jsonnet
Expand Down
25 changes: 25 additions & 0 deletions operations/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,23 @@ How to **investigate**:
- If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Mimir / Scaling` dashboard and make reasonable adjustments as necessary.
- If memcached eviction rate is zero or very low, then it may be caused by "first time" queries

#### Alertmanager

How to **investigate**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be a transitive verb ("How to investigate [what]:")


- Check the `Mimir / Alertmanager` dashboard
- Looking at the dashboard you should see which part of the stack is affected
- Deduce where in the stack the latency is being introduced
- **`Configuration API (gateway) + Alertmanager UI`**
gotjosh marked this conversation as resolved.
Show resolved Hide resolved
- Latency may be caused by the time taken for the gateway to receive the entire request from the client. There are a multitude of reasons this can occur, so communication with the user may be necessary. For example:
- Network issues such as packet loss between the client and gateway.
- Poor performance of intermediate network hops such as load balancers or HTTP proxies.
- Client process having insufficient CPU resources.
- The gateway may need to be scaled up. Use the `Mimir / Scaling` dashboard to check for CPU usage vs requests.
- There could be a problem with authentication (eg. slow to run auth layer)
- **`Distributor`**
gotjosh marked this conversation as resolved.
Show resolved Hide resolved
- Typically, distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the number of alertmanager replicas.
gotjosh marked this conversation as resolved.
Show resolved Hide resolved

### MimirRequestErrors

This alert fires when the rate of 5xx errors of a specific route is > 1% for some time.
Expand All @@ -231,6 +248,14 @@ How to **investigate**:
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there
- If crashing service is query-frontend, querier or store-gateway, and you have "activity tracker" feature enabled, look for `found unfinished activities from previous run` message and subsequent `activity` messages in the log file to see which queries caused the crash.

#### Alertmanager

How to **investigate**:

- Looking at `Mimir / Alertmanager` dashboard you should see in which part of the stack the error originates
- If some replicas are going OOM (`OOMKilled`): scale up or increase the memory
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there

### MimirIngesterUnhealthy

This alert goes off when an ingester is marked as unhealthy. Check the ring web page to see which is marked as unhealthy. You could then check the logs to see if there are any related to that ingester ex: `kubectl logs -f ingester-01 --namespace=prod`. A simple way to resolve this may be to click the "Forgot" button on the ring page, especially if the pod doesn't exist anymore. It might not exist anymore because it was on a node that got shut down, so you could check to see if there are any logs related to the node that pod is/was on, ex: `kubectl get events --namespace=prod | grep cloud-provider-node`.
Expand Down