-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[admin] Fix self leadership transfer #3446
[admin] Fix self leadership transfer #3446
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What's the rationale for returning any error when transferring to self? if a natural leadership movement race results in the target becoming the leader then it's hard to imagine any error handling being anything other than ignoring the error. |
I think this was my suggestion when chatting with Vadim about this earlier, it's totally nonessential though, could go either way. I think I was thinking that internally our calls aren't generally idempotent, but externally they are. |
@VadimPlh I think you can merge this now as-is, or change it to return success. Erroring on the side of more information is usually a good tie breakers I suppose. This is an internal interface, so it isn't necessarily a binding decision. |
Cover letter
Problem
I got in
raft_availability_test.RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1
error:@jcsp investigated to this error and found:
the new leader is node 1, the intended destination. So somehow, the leadership transfer is being done, but the client is receiving the wrong status code.
That old leader is stepping down while in the middle of doing the leadership transfer
`INFO 2022-01-10 11:47:00,964 [sh## Cover letter
Problem
I got in
raft_availability_test.RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1
error:@jcsp investigated to this error and found:
the new leader is node 1, the intended destination. So somehow, the leadership transfer is being done, but the client is receiving the wrong status code.
That old leader is stepping down while in the middle of doing the leadership transfer
INFO 2022-01-10 11:47:00,964 [shard 1] raft - [group_id:1, {kafka/topic-xvvoavvzqf/0}] consensus.cc:134 - Stepping down as leader in term 15, dirty offset 91754
Then subsequently, the client retries, and is getting redirected to the new leader (good) but then the new leader is also responding 503, because of this in consensus.cc
The admin API interprets not_leader as a 503. In this case, not_leader isn't really the right error, we should be returning some kind of "no op" error internally and a 200 to the client (if they ask to transfer leadership to the current leader, that should just be a success)
Solution:
Return 200 in situation when user try to transfer leadership to current