Skip to content

Commit

Permalink
mixin: Fix alert about unhealthy sidecar (#2929)
Browse files Browse the repository at this point in the history
The alert was giving the wrong information as the $value contained
the number of pods that failing to send heartbeat instead of the actual
number of seconds that each sidecar was being unhealthy.

Also the 5 minute interval is probably too low as on large deployments
prometheus could take much longer to come up online and for sidecar to
become actually useful.

As such, we can simply subtract the timestamp of the last heartbeat from
the current time and fire if we are lagging for more than 10 minutes.

Signed-off-by: Markos Chandras <markos@chandras.me>
  • Loading branch information
hwoarang committed Aug 12, 2020
1 parent 7f0364d commit d6305f5
Show file tree
Hide file tree
Showing 5 changed files with 59 additions and 34 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#2970](https://github.com/thanos-io/thanos/pull/2970) Store: Upgrade minio-go/v7 to fix slowness when running on EKS.
- [#2957](https://github.com/thanos-io/thanos/pull/2957) Rule: now sets all of the relevant fields properly; avoids a panic when `/api/v1/rules` is called and the time zone is _not_ UTC; `rules` field is an empty array now if no rules have been defined in a rule group.
- [#2976](https://github.com/thanos-io/thanos/pull/2976) Query: Better rounding for incoming query timestamps.
- [#2929](https://github.com/thanos-io/thanos/pull/2929) Mixin: Fix expression for 'unhealthy sidecar' alert and also increase the timeout for 10 minutes.

### Added

Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,7 +275,7 @@ rules:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value
}} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 600
labels:
severity: critical
```
Expand Down
2 changes: 1 addition & 1 deletion examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ groups:
message: Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{
$value }} seconds.
expr: |
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job=~"thanos-sidecar.*"}) by (job, pod) >= 600
labels:
severity: critical
- name: thanos-store.rules
Expand Down
86 changes: 55 additions & 31 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,47 +22,47 @@ tests:
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 2m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 43
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 42
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 5m
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 10m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 0
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 6m
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 11m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 0
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 0
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 5m
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 10m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 300
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 300
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod)
eval_time: 6m
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 600
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 600
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod)
eval_time: 11m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
value: 360
- labels: '{pod="thanos-sidecar-pod-1"}'
value: 360
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (pod) >= 300
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 660
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 660
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, pod) >= 600
eval_time: 12m
exp_samples:
- labels: '{pod="thanos-sidecar-pod-0"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-0"}'
value: 720
- labels: '{pod="thanos-sidecar-pod-1"}'
- labels: '{job="thanos-sidecar", pod="thanos-sidecar-pod-1"}'
value: 720
alert_rule_test:
- eval_time: 1m
Expand All @@ -71,24 +71,48 @@ tests:
alertname: ThanosSidecarUnhealthy
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
- eval_time: 5m
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
- eval_time: 6m
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 600 seconds.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 600 seconds.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 660 seconds.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 660 seconds.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-0
exp_annotations:
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-0 is unhealthy for 720 seconds.'
- exp_labels:
severity: critical
job: thanos-sidecar
pod: thanos-sidecar-pod-1
exp_annotations:
message: 'Thanos Sidecar is unhealthy for 2 seconds.'
message: 'Thanos Sidecar thanos-sidecar thanos-sidecar-pod-1 is unhealthy for 720 seconds.'
2 changes: 1 addition & 1 deletion mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
message: 'Thanos Sidecar {{$labels.job}} {{$labels.pod}} is unhealthy for {{ $value }} seconds.',
},
expr: |||
count(time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job, pod) >= 300) > 0
time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) by (job, pod) >= 600
||| % thanos.sidecar,
labels: {
severity: 'critical',
Expand Down

0 comments on commit d6305f5

Please sign in to comment.