Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhealthy Compactors Stay in the Ring #142

Closed
joe-elliott opened this issue Aug 21, 2020 · 12 comments · Fixed by #878
Closed

Unhealthy Compactors Stay in the Ring #142

joe-elliott opened this issue Aug 21, 2020 · 12 comments · Fixed by #878

Comments

@joe-elliott
Copy link
Member

I've seen an unhealthy compactor stay in the ring for hours after it was gone. Research this and see if it's a matter of configuration or actually a bug of some kind.

Should we use a simpler discovery mechanism like DNS?

@joe-elliott
Copy link
Member Author

joe-elliott commented Sep 16, 2020

This appears to happen occassionally on rollout. Also, even after manually forgetting them they come back. Perhaps due to gossip?

image

@joe-elliott
Copy link
Member Author

joe-elliott commented Nov 3, 2020

If you are seeing this issue and are unable to successfully forget a compactor it is recommended to click the "Forget" button, wait a full 10 seconds, stand up, stretch, get all your grocery shopping done, come back and then hit F5. The compactor should be forgotten.

If you quickly spam "Forget" then old compactors seem to stay in the ring. This is believed to be an issue with the memberlist propagation of the ring.

@annanay25
Copy link
Contributor

Forget behaviour may be fixed with cortexproject/cortex#3603 and will reflect once we vendor in the latest cortex version.

Research for a way to not-care about a compactor disappearing from the ring.

@gouthamve
Copy link
Member

Can this be closed?

@joe-elliott
Copy link
Member Author

We believe that #442 fixed this issue, but have not seen it in again in our internal cluster to confirm. I'd rather keep this open until we verify.

@slim-bean
Copy link

This happened again, could not get the unhealthy compactor to leave, ended up port-forwarding 4-5 compactors between two people and clicking forget a lot and eventually it went away.

@joe-elliott
Copy link
Member Author

@pstibrany has reported he feels the issue will be fixed in Cortex 1.7.0. We will keep an eye on it after the upgrade.

@pstibrany
Copy link
Member

It's the same cortexproject/cortex#3603 fix, but Tempo currently doesn't use Cortex version with that fix in.

@joe-elliott
Copy link
Member Author

Confirmed fixed in our environment by #512

Thanks @pstibrany!

@joe-elliott
Copy link
Member Author

We've seen this again, but found a way to mitigate. The changes in Cortex have certainly made it easier to deal with, but it does still happen occassionally.

Details have been added to the appropriate runbook entries: #532

@joe-elliott joe-elliott reopened this Feb 17, 2021
@mdisibio mdisibio added this to the kv store complete milestone Feb 25, 2021
@joe-elliott
Copy link
Member Author

joe-elliott commented Mar 9, 2021

Further updates on this. We have since switched to using this values in our memberlist config and have not been able to trigger this issue since:

memberlist:
    left_ingesters_timeout: 30m
    pull_push_interval: 15s

Still keeping an eye on things.

@joe-elliott
Copy link
Member Author

joe-elliott commented Aug 13, 2021

Possible fixes going into Cortex now:

cortexproject/cortex#4420
cortexproject/cortex#4419

TODO:

  • revendor Cortex with these changes and confirm it fixes the issue
  • find sensible defaults for memberlist now that propagation is reduced
  • update runbooks to remove mention of this issue.
  • close this issue!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants