Skip to content

Commit

Permalink
Mixin/Playbooks: Add Alertmanager suggestions for `MimirRequestErrors…
Browse files Browse the repository at this point in the history
…` and `MimirRequestLatency` (#1702)

* Mixin/Playbooks: Add Alertmanager suggestions for `MimirRequestErrors` and `MimirRequestLatency`

* update PR number

* Update operations/mimir-mixin/docs/playbooks.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* Update operations/mimir-mixin/docs/playbooks.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>

* Update operations/mimir-mixin/docs/playbooks.md

Co-authored-by: Marco Pracucci <marco@pracucci.com>

Co-authored-by: Marco Pracucci <marco@pracucci.com>
  • Loading branch information
gotjosh and pracucci committed Apr 14, 2022
1 parent 562ae26 commit d501e8b
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 0 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@
- `MimirContinuousTestFailed`
* [ENHANCEMENT] Added `per_cluster_label` support to allow to change the label name used to differentiate between Kubernetes clusters. #1651
* [ENHANCEMENT] Dashboards: Show QPS and latency of the Alertmanager Distributor. #1696
* [ENHANCEMENT] Playbooks: Add Alertmanager suggestions for `MimirRequestErrors` and `MimirRequestLatency` #1702
* [BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629

### Jsonnet
Expand Down
25 changes: 25 additions & 0 deletions operations/mimir-mixin/docs/playbooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,23 @@ How to **investigate**:
- If memcached eviction rate is high, then you should scale up memcached replicas. Check the recommendations by `Mimir / Scaling` dashboard and make reasonable adjustments as necessary.
- If memcached eviction rate is zero or very low, then it may be caused by "first time" queries

#### Alertmanager

How to **investigate**:

- Check the `Mimir / Alertmanager` dashboard
- Looking at the dashboard you should see which part of the stack is affected
- Deduce where in the stack the latency is being introduced
- **Configuration API (gateway) + Alertmanager UI**
- Latency may be caused by the time taken for the gateway to receive the entire request from the client. There are a multitude of reasons this can occur, so communication with the user may be necessary. For example:
- Network issues such as packet loss between the client and gateway.
- Poor performance of intermediate network hops such as load balancers or HTTP proxies.
- Client process having insufficient CPU resources.
- The gateway may need to be scaled up. Use the `Mimir / Scaling` dashboard to check for CPU usage vs requests.
- There could be a problem with authentication (eg. slow to run auth layer)
- **Alertmanager distributor**
- Typically, Alertmanager distributor p99 latency is in the range 50-100ms. If the distributor latency is higher than this, you may need to scale up the number of alertmanager replicas.

### MimirRequestErrors

This alert fires when the rate of 5xx errors of a specific route is > 1% for some time.
Expand All @@ -231,6 +248,14 @@ How to **investigate**:
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there
- If crashing service is query-frontend, querier or store-gateway, and you have "activity tracker" feature enabled, look for `found unfinished activities from previous run` message and subsequent `activity` messages in the log file to see which queries caused the crash.

#### Alertmanager

How to **investigate**:

- Looking at `Mimir / Alertmanager` dashboard you should see in which part of the stack the error originates
- If some replicas are going OOM (`OOMKilled`): scale up or increase the memory
- If the failing service is crashing / panicking: look for the stack trace in the logs and investigate from there

### MimirIngesterUnhealthy

This alert goes off when an ingester is marked as unhealthy. Check the ring web page to see which is marked as unhealthy. You could then check the logs to see if there are any related to that ingester ex: `kubectl logs -f ingester-01 --namespace=prod`. A simple way to resolve this may be to click the "Forgot" button on the ring page, especially if the pod doesn't exist anymore. It might not exist anymore because it was on a node that got shut down, so you could check to see if there are any logs related to the node that pod is/was on, ex: `kubectl get events --namespace=prod | grep cloud-provider-node`.
Expand Down

0 comments on commit d501e8b

Please sign in to comment.