Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: Fix alert about unhealthy sidecar #2929

Merged
merged 1 commit into from
Aug 12, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#2970](https://github.com/thanos-io/thanos/pull/2970) Store: Upgrade minio-go/v7 to fix slowness when running on EKS.
- [#2957](https://github.com/thanos-io/thanos/pull/2957) Rule: now sets all of the relevant fields properly; avoids a panic when `/api/v1/rules` is called and the time zone is _not_ UTC; `rules` field is an empty array now if no rules have been defined in a rule group.
- [#2976](https://github.com/thanos-io/thanos/pull/2976) Query: Better rounding for incoming query timestamps.
- [#2929](https://github.com/thanos-io/thanos/pull/2929) Mixin: Fix expression for 'unhealthy sidecar' alert and also increase the timeout for 10 minutes.

### Added

Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ rules:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value
hwoarang marked this conversation as resolved.
Show resolved Hide resolved
}} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 600
labels:
severity: critical
```
Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ groups:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{
$value }} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 600
labels:
severity: critical
- name: thanos-store.rules
Expand Down
86 changes: 55 additions & 31 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,47 +22,47 @@ tests:
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 2m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 43
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 42
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 5m
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 10m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 0
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 6m
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 11m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 0
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 0
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 5m
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 10m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 300
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 300
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 6m
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 600
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 600
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 11m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 360
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 360
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod) >= 300
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 660
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 660
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod) >= 600
eval_time: 12m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 720
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 720
alert_rule_test:
- eval_time: 1m
Expand All @@ -71,24 +71,48 @@ tests:
alertname: ThanosSidecarUnhealthy
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
- eval_time: 5m
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
- eval_time: 6m
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 600 seconds.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 600 seconds.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 660 seconds.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 660 seconds.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 720 seconds.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 720 seconds.'
2 changes: 1 addition & 1 deletion mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
message: 'Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.',
},
expr: |||
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job, pod) >= 600
||| % thanos.sidecar,
labels: {
severity: 'critical',
Expand Down