Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[admin] Fix self leadership transfer #3446

Merged
merged 3 commits into from
Jan 12, 2022

Conversation

VadimPlh
Copy link
Contributor

@VadimPlh VadimPlh commented Jan 11, 2022

Cover letter

Problem

I got in raft_availability_test.RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 error:

Max retries exceeded with url: /v1/partitions/kafka/topic-xvvoavvzqf/0/transfer_leadership?target=1 (Caused by ResponseError('too many 503 error responses')) 

@jcsp investigated to this error and found:

the new leader is node 1, the intended destination. So somehow, the leadership transfer is being done, but the client is receiving the wrong status code.

That old leader is stepping down while in the middle of doing the leadership transfer
`INFO 2022-01-10 11:47:00,964 [sh## Cover letter

Problem

I got in raft_availability_test.RaftAvailabilityTest.test_leader_transfers_recovery.acks=-1 error:

Max retries exceeded with url: /v1/partitions/kafka/topic-xvvoavvzqf/0/transfer_leadership?target=1 (Caused by ResponseError('too many 503 error responses')) 

@jcsp investigated to this error and found:

the new leader is node 1, the intended destination. So somehow, the leadership transfer is being done, but the client is receiving the wrong status code.

That old leader is stepping down while in the middle of doing the leadership transfer
INFO 2022-01-10 11:47:00,964 [shard 1] raft - [group_id:1, {kafka/topic-xvvoavvzqf/0}] consensus.cc:134 - Stepping down as leader in term 15, dirty offset 91754

Then subsequently, the client retries, and is getting redirected to the new leader (good) but then the new leader is also responding 503, because of this in consensus.cc

    if (*target == _self.id()) {
        vlog(_ctxlog.warn, "Cannot transfer leadership to self");
        return seastar::make_ready_future<std::error_code>(
          make_error_code(errc::not_leader));
    }

The admin API interprets not_leader as a 503. In this case, not_leader isn't really the right error, we should be returning some kind of "no op" error internally and a 200 to the client (if they ask to transfer leadership to the current leader, that should just be a success)

Solution:

Return 200 in situation when user try to transfer leadership to current

Copy link
Member

@mmaslankaprv mmaslankaprv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dotnwat
Copy link
Member

dotnwat commented Jan 11, 2022

What's the rationale for returning any error when transferring to self? if a natural leadership movement race results in the target becoming the leader then it's hard to imagine any error handling being anything other than ignoring the error.

@jcsp
Copy link
Contributor

jcsp commented Jan 11, 2022

What's the rationale for returning any error when transferring to self?

I think this was my suggestion when chatting with Vadim about this earlier, it's totally nonessential though, could go either way. I think I was thinking that internally our calls aren't generally idempotent, but externally they are.

@dotnwat
Copy link
Member

dotnwat commented Jan 11, 2022

@VadimPlh I think you can merge this now as-is, or change it to return success. Erroring on the side of more information is usually a good tie breakers I suppose. This is an internal interface, so it isn't necessarily a binding decision.

@VadimPlh VadimPlh merged commit b0a24b4 into redpanda-data:dev Jan 12, 2022
mmaslankaprv added a commit that referenced this pull request Feb 18, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants