Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Thanos Ruler] Alerts closing and opening instantly. #4399

Closed
Whyeasy opened this issue Jul 2, 2021 · 2 comments
Closed

[Thanos Ruler] Alerts closing and opening instantly. #4399

Whyeasy opened this issue Jul 2, 2021 · 2 comments
Labels

Comments

@Whyeasy
Copy link

Whyeasy commented Jul 2, 2021

Thanos, Prometheus and Golang version used:

Thanos version: 0.21.1
Prometheus version: 2.27.1
Alertmanager: 0.22.2

Object Storage Provider:

GCS

What happened:

We run our Thanos Ruler in HA with some recording rules and have alerts in place for some of these recording rules. We are experiencing some issues with alerting. We see that the alerts are getting closed and instantly getting created again.

For debugging we created a watchdog alert:

name: ruler-watchdog
rules:
  - alert: ruler-watchdog
    expr: sum(up{namespace="thanos",app_kubernetes_io_component="query"}) by (app_kubernetes_io_component) > 0
    for: 1m
    annotations:
      summary: Watchdog for Thanos Ruler
      message: Watchdog for Thanos Ruler
    labels:
      severity: warning

This one is always firing and we see the same thing happening.

We don't really see a pattern when these opening and closing of alerts happen. We first thought is was due to some restarts or something. The Alertmanager we use for Thanos Ruler is also used by Prometheus on the same cluster, which doesn't show this behavior.

What you expected to happen:

I would expect that once an alert is created, it's stays open until the alert is solved.

How to reproduce it (as minimally and precisely as possible):

  • Ruler in HA with 3 instances
  • Alertmanager in HA with 3 instances
  • Create a watchdog alert

Anything else we need to know:

The ruler is created via the Prometheus Operator, but deployed in the same namespace as Thanos on the same cluster. The recording rules are working properly and we don't see any gaps in these metrics.

@stale
Copy link

stale bot commented Sep 3, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 3, 2021
@stale
Copy link

stale bot commented Sep 19, 2021

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Sep 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant