Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingesters stuck in LEAVING state when increasing their number of tokens #3801

Open
agardiman opened this issue Dec 22, 2022 · 2 comments
Open

Comments

@agardiman
Copy link

Describe the bug

When I change the CLI option -ingester.ring.num-tokens , the pods leave the ring with the state LEAVING and when they restart, they detect that their entry is already present in the ring, and they get stuck in that state, never reaching ACTIVE.
This is when the ingesters are set to NOT deregister form the ring at shutdown.

To Reproduce

Steps to reproduce the behavior:

  1. Start (Mimir 2.4.0) ingesters with -ingester.ring.num-tokens=64
  2. change -ingester.ring.num-tokens to something bigger, even just 65
  3. the new ingesters will never come up as ACTIVE

Expected behavior

The new ingester detect that they they already have some tokens locally but they should have more, so they produce the missing tokens, insert them in the ring and reach the ACTIVE state and start ingesting metrics.

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet
  • Mimir version: 2.4.0

Additional Context

The following are the logs from an afffected instance

➜  ~ kubectl logs -f ingester-zone-a-0
level=info ts=2022-12-21T10:38:44.197631625Z caller=main.go:210 msg="Starting application" version="(version=2.4.0, branch=HEAD, revision=32137ee)"
level=info ts=2022-12-21T10:38:44.198422782Z caller=server.go:306 http=[::]:80 grpc=[::]:9095 msg="server listening on addresses"
...
level=info ts=2022-12-21T10:38:44.204627871Z caller=memberlist_client.go:436 msg="Using memberlist cluster label and node name" cluster_label= node=ingester-zone-a-0-1abc387f
...
level=info ts=2022-12-21T10:38:44.211937824Z caller=memberlist_client.go:543 msg="memberlist fast-join starting" nodes_found=9 to_join=4
level=info ts=2022-12-21T10:38:44.219120657Z caller=memberlist_client.go:563 msg="memberlist fast-join finished" joined_nodes=4 elapsed_time=13.697477ms
level=info ts=2022-12-21T10:38:44.219183764Z caller=memberlist_client.go:576 msg="joining memberlist cluster" join_members=dns+gossip-ring.cortex.svc.cluster.local:7946
level=info ts=2022-12-21T10:38:44.236074303Z caller=memberlist_client.go:595 msg="joining memberlist cluster succeeded" reached_nodes=9 elapsed_time=16.890905ms
...
level=info ts=2022-12-21T10:38:44.400459071Z caller=module_service.go:82 msg=initialising module=memberlist-kv
...
level=info ts=2022-12-21T10:38:44.523936265Z caller=mimir.go:762 msg="Application started"
level=info ts=2022-12-21T10:38:44.525500762Z caller=lifecycler.go:612 msg="existing entry found in ring" state=LEAVING tokens=64 ring=ingester

Just as an experiment, if I instead set the option to unregister at shutdown, the instance unregister form the ring, then it finds the old 64 tokens from the file system, add the remaining tokens (leaving the old 64 untouched) and registering again with the old 64 tokens plus the new ones.

@pracucci
Copy link
Collaborator

I think you've hit this issue:
grafana/dskit#73

If so, it's a known issue. There was some work on it back in time grafana/dskit#79 but we haven't got the time to follow up on it yet.

@agardiman
Copy link
Author

It seems the same issue indeed! I left some comments on the PR. If there is no one working on it, I'm happy to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants