Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

rekup · 2024-03-20T06:46:02Z

Describe the bug

We use mimir and the rules from the mimir-mixin. Recently we onboarded a customer who sends kubernetes metrics to our mimir cluster. Due to a configuration error on the customers kubernetes cluster, the kubelet metrics were scraped multiple times (multiple servicemonitors for kubelet). In the kube-prometheus stack there are the following rules:

  spec:
    groups:
    - name: kubelet.rules
      rules:
      - expr: histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.99"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
      - expr: histogram_quantile(0.9, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.9"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
      - expr: histogram_quantile(0.5, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.5"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile

This rules will fail with a many-to-many matching not allowed error if the kubelet is scraped by multiple jobs. This is obiously a user error and in the mimir logs we can observe the corresponding error messages:

ts=2024-03-20T06:27:56.041993291Z caller=group.go:480 level=warn name=KubeletPodStartUpLatencyHigh index=5 component=ruler insight=true user=tenant1 file=/var/lib/mimir/ruler/tenant1/agent%2Fmonitoring%2Fkube-prometheus-stack-kubernetes-system-kubelet%2Ff4851e1e-c337-4c08-8dc2-4c47642212f9 group=kubernetes-system-kubelet msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n  > 60\nfor: 15m\nlabels:\n  severity: warning\nannotations:\n  description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds\n    on node {{ $labels.node }}.\n  runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh\n  summary: Kubelet Pod startup latency is too high.\n" err="found duplicate series for the match group {instance=\"192.168.16.181:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.16.181:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"master1.k8s-test.tenant.org\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", service=\"kube-prometheus-stack-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.16.181:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"master1.k8s-test.tenant.org\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"

As soon as Mimir evaluates these rules, the MimirRulerTooManyFailedQueries alert is triggered. However, according to the runbook of this alert, these user errors should not trigger this alert:

Each rule evaluation may fail due to many reasons, eg. due to invalid PromQL expression, or query hits limits on number of chunks. These are “user errors”, and this alert ignores them.

(https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#mimirrulertoomanyfailedqueries)

To Reproduce

Steps to reproduce the behavior:

Start Mimir (Mimir, version 2.11.0 (branch: release-2.11, revision: c8939ea))
Create multiple scrape jobs for the kubelet
Create the recording rules specified above
Check the result of the following rule (as per the mimir mixin)

100 * (sum by (cluster, team, instance) (rate(cortex_ruler_queries_failed_total{job="mimir"}[5m])) / sum by (cluster, team, instance) (rate(cortex_ruler_queries_total{job="mimir"}[5m]))) > 1

Expected behavior

I would expect that user errors (such as rule with many-to-many matching will not increase the cortex_ruler_queries_failed_total counter.

Environment

Infrastructure: bare-metal
Deployment tool: ansible
mimir mixin (main, https://github.com/grafana/mimir/blob/main/operations/mimir-mixin-compiled/alerts.yaml#L434)

Additional Context

I saw this cortex issue which might be of relevance

The text was updated successfully, but these errors were encountered:

rishabhkumar92 · 2024-03-25T05:08:47Z

cc: @krajorama

We also got MimirRulerTooManyFailedQueries due to a bad rule uploaded by user.

krajorama · 2024-04-02T07:17:43Z

Reproduced with mimir-distributed 5.2.2 (Mimir 2.11).

Update: this repro uses the built in querier in the ruler, not the remote ruler-querier functionality!

I've started the chart with metamonitor enabled to get some metrics and created a recording rule for cortex_ingester_active_series{} * on (container) cortex_build_info which results in error:

execution: found duplicate series for the match group {container="ingester"} on the right hand-side of the operation:
 [{__name__="cortex_build_info", __replica__="replica-0", branch="HEAD", cluster="krajo", container="ingester", 
endpoint="http-metrics", goversion="go1.21.4", instance="10.1.23.175:8080", job="dev/ingester", namespace="dev", 
pod="krajo-mimir-ingester-zone-a-0", revision="c8939ea", service="krajo-mimir-ingester-zone-a", version="2.11.0"}, 
{__name__="cortex_build_info", __replica__="replica-0", branch="HEAD", cluster="krajo", container="ingester", 
endpoint="http-metrics", goversion="go1.21.4", instance="10.1.23.145:8080", job="dev/ingester", namespace="dev", 
pod="krajo-mimir-ingester-zone-b-0", revision="c8939ea", service="krajo-mimir-ingester-zone-b", version="2.11.0"}];
many-to-many matching not allowed: matching labels must be unique on one side

I see cortex_ruler_queries_failed_total increase.

krajorama · 2024-04-02T07:47:30Z

I've upgraded to 5.3.0-weekly.283 which has build from 26th March (https://github.com/grafana/mimir/tree/r283 , 7728f42 ). It doesn't have the issue , cortex_ruler_queries_failed_total dropped to 0, the logs still contain the error message.

At the same time I see cortex_prometheus_rule_evaluation_failures_total show the errors, with these labels:

__replica__="replica-0",
cluster="krajo",
container="ruler",
endpoint="http-metrics",
instance="10.1.23.183:8080",
job="dev/ruler",
namespace="dev",
pod="krajo-mimir-ruler-85dc995ff6-9h5jj",
rule_group="/data/metamonitoring/krajons;krajogroup",
service="krajo-mimir-ruler",
user="metamonitoring"

I'm pretty sure this was actually fixed by me in #7567 . However this PR just missed the cut off for 2.12 release by a couple of days.

krajorama · 2024-04-03T06:03:17Z

Could not reproduced with remote ruler on latest weekly (r284-6db12671).

At first I thought I did, but the ruler dashboard actually uses cortex_prometheus_rule_evaluation_failures_total that started increasing and not cortex_ruler_queries_failed_total, which remained at 0.

krajorama · 2024-04-03T07:34:23Z

Tested in v2.12.0-rc.4. Could not reproduce, so I think the remote ruler version is fixed in 2.12 most likely by #7472 .

Summary: should be fixed in remote ruler case in 2.12. And will be fixed for normal ruler case in 2.13.

56quarters · 2024-05-14T15:46:01Z

2.12 has been released. We should be good to close this, right?

krajorama self-assigned this Apr 2, 2024

56quarters closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

rekup commented Mar 20, 2024

rishabhkumar92 commented Mar 25, 2024

krajorama commented Apr 2, 2024 •

edited

Loading

krajorama commented Apr 2, 2024

krajorama commented Apr 3, 2024 •

edited

Loading

krajorama commented Apr 3, 2024

56quarters commented May 14, 2024

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

Comments

rekup commented Mar 20, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

rishabhkumar92 commented Mar 25, 2024

krajorama commented Apr 2, 2024 • edited Loading

krajorama commented Apr 2, 2024

krajorama commented Apr 3, 2024 • edited Loading

krajorama commented Apr 3, 2024

56quarters commented May 14, 2024

krajorama commented Apr 2, 2024 •

edited

Loading

krajorama commented Apr 3, 2024 •

edited

Loading