Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

Closed
rekup opened this issue Mar 20, 2024 · 6 comments
Closed

Ruler: MimirRulerTooManyFailedQueries alert due to user error #7668

rekup opened this issue Mar 20, 2024 · 6 comments
Assignees

Comments

@rekup
Copy link
Contributor

rekup commented Mar 20, 2024

Describe the bug

We use mimir and the rules from the mimir-mixin. Recently we onboarded a customer who sends kubernetes metrics to our mimir cluster. Due to a configuration error on the customers kubernetes cluster, the kubelet metrics were scraped multiple times (multiple servicemonitors for kubelet). In the kube-prometheus stack there are the following rules:

  spec:
    groups:
    - name: kubelet.rules
      rules:
      - expr: histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.99"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
      - expr: histogram_quantile(0.9, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.9"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
      - expr: histogram_quantile(0.5, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
          metrics_path="/metrics"}[5m])) by (cluster, instance, le) * on (cluster,
          instance) group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
        labels:
          quantile: "0.5"
        record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile

This rules will fail with a many-to-many matching not allowed error if the kubelet is scraped by multiple jobs. This is obiously a user error and in the mimir logs we can observe the corresponding error messages:

ts=2024-03-20T06:27:56.041993291Z caller=group.go:480 level=warn name=KubeletPodStartUpLatencyHigh index=5 component=ruler insight=true user=tenant1 file=/var/lib/mimir/ruler/tenant1/agent%2Fmonitoring%2Fkube-prometheus-stack-kubernetes-system-kubelet%2Ff4851e1e-c337-4c08-8dc2-4c47642212f9 group=kubernetes-system-kubelet msg="Evaluating rule failed" rule="alert: KubeletPodStartUpLatencyHigh\nexpr: histogram_quantile(0.99, sum by (cluster, instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job=\"kubelet\",metrics_path=\"/metrics\"}[5m])))\n  * on (cluster, instance) group_left (node) kubelet_node_name{job=\"kubelet\",metrics_path=\"/metrics\"}\n  > 60\nfor: 15m\nlabels:\n  severity: warning\nannotations:\n  description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds\n    on node {{ $labels.node }}.\n  runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh\n  summary: Kubelet Pod startup latency is too high.\n" err="found duplicate series for the match group {instance=\"192.168.16.181:10250\"} on the right hand-side of the operation: [{__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.16.181:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"master1.k8s-test.tenant.org\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", service=\"kube-prometheus-stack-kubelet\"}, {__name__=\"kubelet_node_name\", endpoint=\"https-metrics\", instance=\"192.168.16.181:10250\", job=\"kubelet\", metrics_path=\"/metrics\", namespace=\"kube-system\", node=\"master1.k8s-test.tenant.org\", prometheus=\"monitoring/kube-prometheus-stack-prometheus\", service=\"kube-prometheus-kube-prome-kubelet\"}];many-to-many matching not allowed: matching labels must be unique on one side"

As soon as Mimir evaluates these rules, the MimirRulerTooManyFailedQueries alert is triggered. However, according to the runbook of this alert, these user errors should not trigger this alert:

Each rule evaluation may fail due to many reasons, eg. due to invalid PromQL expression, or query hits limits on number of chunks. These are “user errors”, and this alert ignores them.

(https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#mimirrulertoomanyfailedqueries)

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir (Mimir, version 2.11.0 (branch: release-2.11, revision: c8939ea))
  2. Create multiple scrape jobs for the kubelet
  3. Create the recording rules specified above
  4. Check the result of the following rule (as per the mimir mixin)
100 * (sum by (cluster, team, instance) (rate(cortex_ruler_queries_failed_total{job="mimir"}[5m])) / sum by (cluster, team, instance) (rate(cortex_ruler_queries_total{job="mimir"}[5m]))) > 1

Expected behavior

I would expect that user errors (such as rule with many-to-many matching will not increase the cortex_ruler_queries_failed_total counter.

Environment

Additional Context

I saw this cortex issue which might be of relevance

@rishabhkumar92
Copy link

cc: @krajorama

We also got MimirRulerTooManyFailedQueries due to a bad rule uploaded by user.

@krajorama krajorama self-assigned this Apr 2, 2024
@krajorama
Copy link
Contributor

krajorama commented Apr 2, 2024

Reproduced with mimir-distributed 5.2.2 (Mimir 2.11).

Update: this repro uses the built in querier in the ruler, not the remote ruler-querier functionality!

I've started the chart with metamonitor enabled to get some metrics and created a recording rule for cortex_ingester_active_series{} * on (container) cortex_build_info which results in error:

execution: found duplicate series for the match group {container="ingester"} on the right hand-side of the operation:
 [{__name__="cortex_build_info", __replica__="replica-0", branch="HEAD", cluster="krajo", container="ingester", 
endpoint="http-metrics", goversion="go1.21.4", instance="10.1.23.175:8080", job="dev/ingester", namespace="dev", 
pod="krajo-mimir-ingester-zone-a-0", revision="c8939ea", service="krajo-mimir-ingester-zone-a", version="2.11.0"}, 
{__name__="cortex_build_info", __replica__="replica-0", branch="HEAD", cluster="krajo", container="ingester", 
endpoint="http-metrics", goversion="go1.21.4", instance="10.1.23.145:8080", job="dev/ingester", namespace="dev", 
pod="krajo-mimir-ingester-zone-b-0", revision="c8939ea", service="krajo-mimir-ingester-zone-b", version="2.11.0"}];
many-to-many matching not allowed: matching labels must be unique on one side

I see cortex_ruler_queries_failed_total increase.

@krajorama
Copy link
Contributor

I've upgraded to 5.3.0-weekly.283 which has build from 26th March (https://github.com/grafana/mimir/tree/r283 , 7728f42 ). It doesn't have the issue , cortex_ruler_queries_failed_total dropped to 0, the logs still contain the error message.

At the same time I see cortex_prometheus_rule_evaluation_failures_total show the errors, with these labels:

__replica__="replica-0",
cluster="krajo",
container="ruler",
endpoint="http-metrics",
instance="10.1.23.183:8080",
job="dev/ruler",
namespace="dev",
pod="krajo-mimir-ruler-85dc995ff6-9h5jj",
rule_group="/data/metamonitoring/krajons;krajogroup",
service="krajo-mimir-ruler",
user="metamonitoring"

I'm pretty sure this was actually fixed by me in #7567 . However this PR just missed the cut off for 2.12 release by a couple of days.

@krajorama
Copy link
Contributor

krajorama commented Apr 3, 2024

Could not reproduced with remote ruler on latest weekly (r284-6db12671).

At first I thought I did, but the ruler dashboard actually uses cortex_prometheus_rule_evaluation_failures_total that started increasing and not cortex_ruler_queries_failed_total, which remained at 0.

@krajorama
Copy link
Contributor

Tested in v2.12.0-rc.4. Could not reproduce, so I think the remote ruler version is fixed in 2.12 most likely by #7472 .

Summary: should be fixed in remote ruler case in 2.12. And will be fixed for normal ruler case in 2.13.

@56quarters
Copy link
Contributor

2.12 has been released. We should be good to close this, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants