Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c/metadata_dissemination: fix applying updates from health #3487

Merged
merged 2 commits into from
Jan 17, 2022

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Jan 14, 2022

Cover letter

c/metadata_dissemination: fix applying updates from health

Node health reports may disagree with one another about
leadership in a particular term, if some of them claim
that it's null (because they've seen the term in their
own log after restart, but not yet received an append_entries
from the leader).

To avoid a rogue node health report resetting the leadership
of a topic to null, ignore health report leadership
information if it claims a null leader.

Non-null claims are always believable, because of the term:
if they're out of date, then they were still correct for
the term they claim, and we ignore those out of date
terms in partition_leaders_table::update_partition_leader.

Fixes #3486

Release notes

Improvements

  • Fix an issue where nodes may have stale leadership metadata for a short period after a node restarts

This regressed in c8f4f12

Node health reports may disagree with one another about
leadership in a particular term, if some of them claim
that it's null (because they've seen the term in their
own log after restart, but not yet received an append_entries
from the leader).

To avoid a rogue node health report resetting the leadership
of a topic to null, ignore health report leadership
information if it claims a null leader.

Non-null claims are always believable, because of the term:
if they're out of date, then they were still correct for
the term they claim, and we ignore those out of date
terms in partition_leaders_table::update_partition_leader.

Fixes redpanda-data#3486
@jcsp jcsp added kind/bug Something isn't working area/controller labels Jan 14, 2022
@jcsp jcsp changed the title Fix issue 3486 null leaders c/metadata_dissemination: fix applying updates from health Jan 14, 2022
Copy link
Member

@mmaslankaprv mmaslankaprv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jcsp
Copy link
Contributor Author

jcsp commented Jan 14, 2022

Debug build failure was #3384, retrying it

@jcsp jcsp merged commit 315fdd7 into redpanda-data:dev Jan 17, 2022
@jcsp jcsp deleted the fix-issue-3486-null-leaders branch January 17, 2022 10:13
mmaslankaprv added a commit that referenced this pull request Jan 28, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster: Stale leadership in partition_leaders_table on leader (failure in ClusterConfigTest.test_restart)
2 participants