Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mixin: Fix alert about unhealthy sidecar #2929

Merged
merged 1 commit into from
Aug 12, 2020

Conversation

hwoarang
Copy link
Contributor

@hwoarang hwoarang commented Jul 23, 2020

The alert was giving the wrong information as the $value contained
the number of pods that failing to send heartbeat instead of the actual
number of seconds that each sidecar was being unhealthy.

Also the 5 minute interval is probably too low as on large deployments
prometheus could take much longer to come up online and for sidecar to
become actually useful.

As such, we can simply subtract the timestamp of the last heartbeat from
the current time and fire if we are lagging for more than 10 minutes.

This is also consistent with the current unit tests.

  • I added CHANGELOG entry for this change.

Changes

Improve alert for sidecar to provide correct information and not fire too soon

Verification

Deployed locally and did a prometheus update and it did not fire anymore

@hwoarang hwoarang force-pushed the fix-alert-for-sidecar branch 7 times, most recently from 6c4a018 to 6678f08 Compare July 28, 2020 14:10
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 🥇 It needs rebasing otherwise we can merge

@hwoarang
Copy link
Contributor Author

hwoarang commented Aug 4, 2020

lgtm 1st_place_medal It needs rebasing otherwise we can merge

@kakkoyun thank you for the approval. I have just rebased it

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized a minor thing, could have another look at it?

examples/alerts/alerts.md Show resolved Hide resolved
examples/alerts/alerts.md Outdated Show resolved Hide resolved
@kakkoyun
Copy link
Member

@hwoarang Friendly ping.

@hwoarang
Copy link
Contributor Author

@kakkoyun apologies for the delay but due to holidays, connectivity and time are at a premium :) nevertheless I will try to get to it as soon as possible.

@hwoarang hwoarang force-pushed the fix-alert-for-sidecar branch 2 times, most recently from c82af6d to bb529df Compare August 12, 2020 13:56
@hwoarang hwoarang requested a review from kakkoyun August 12, 2020 13:57
@hwoarang
Copy link
Contributor Author

@kakkoyun I believe I have addressed all your concerns now

The alert was giving the wrong information as the $value contained
the number of pods that failing to send heartbeat instead of the actual
number of seconds that each sidecar was being unhealthy.

Also the 5 minute interval is probably too low as on large deployments
prometheus could take much longer to come up online and for sidecar to
become actually useful.

As such, we can simply subtract the timestamp of the last heartbeat from
the current time and fire if we are lagging for more than 10 minutes.

Signed-off-by: Markos Chandras <markos@chandras.me>
Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kakkoyun
Copy link
Member

@hwoarang Thanks a lot 🙏

@kakkoyun kakkoyun merged commit d6305f5 into thanos-io:master Aug 12, 2020
@hwoarang hwoarang deleted the fix-alert-for-sidecar branch August 12, 2020 15:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants