cluster: Stale leadership in partition_leaders_table on leader (failure in `ClusterConfigTest.test_restart`) #3486

jcsp · 2022-01-14T11:44:30Z

This is manifesting as a failure of ClusterConfigTest.test_restart because that test checks for convergence of config versions, and config versions only get updated if nodes can see a controller leader. This might be destabilizing other tests too, if they have timeouts that rely on controller leader being available within a certain time.

https://buildkite.com/vectorized/redpanda/builds/6142#3eb5ad1e-c519-4c16-b39c-5355ff4cf590

After an election has succeeded, the metadata dissemination service's ticker is still using the content of node health reports to set leadership. If the last node in the list of node health reports is saying leader=null, then this continuously overrides the local partition leader table until the next round of health reports come in.

This behavior was introduced in #3355


commit c8f4f12ae88dafab7f26b1e99e4711b3fa39642f
Author: Michal Maslanka <michal@vectorized.io>
Date:   Wed Jan 5 13:18:31 2022 +0100

    c/dissemination: use health manager information to update leaders

The text was updated successfully, but these errors were encountered:

This regressed in c8f4f12 Node health reports may disagree with one another about leadership in a particular term, if some of them claim that it's null (because they've seen the term in their own log after restart, but not yet received an append_entries from the leader). To avoid a rogue node health report resetting the leadership of a topic to null, ignore health report leadership information if it claims a null leader. Non-null claims are always believable, because of the term: if they're out of date, then they were still correct for the term they claim, and we ignore those out of date terms in partition_leaders_table::update_partition_leader. Fixes redpanda-data#3486

gousteris · 2022-01-17T11:58:41Z

@jcsp seen again https://buildkite.com/vectorized/redpanda/builds/6240#f98e6822-ff17-4ad9-b8d8-5d8a7382b4c5 should we re-open it?

jcsp · 2022-01-17T12:14:16Z

@gousteris no, that failure is from before this merged.

This regressed in c8f4f12 Node health reports may disagree with one another about leadership in a particular term, if some of them claim that it's null (because they've seen the term in their own log after restart, but not yet received an append_entries from the leader). To avoid a rogue node health report resetting the leadership of a topic to null, ignore health report leadership information if it claims a null leader. Non-null claims are always believable, because of the term: if they're out of date, then they were still correct for the term they claim, and we ignore those out of date terms in partition_leaders_table::update_partition_leader. Fixes redpanda-data#3486 (cherry picked from commit 1335dff)

jcsp added kind/bug Something isn't working area/controller labels Jan 14, 2022

jcsp mentioned this issue Jan 14, 2022

c/metadata_dissemination: fix applying updates from health #3487

Merged

jcsp self-assigned this Jan 14, 2022

mmaslankaprv mentioned this issue Jan 14, 2022

Failure in raft_availability_test.RaftAvailabilityTest.test_leader_restart #3468

Open

jcsp closed this as completed in #3487 Jan 17, 2022

jcsp mentioned this issue Jan 17, 2022

kafka: support setting broker properties via kafka APIs #3267

Merged

jcsp added the ci-failure label Jan 17, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: Stale leadership in partition_leaders_table on leader (failure in `ClusterConfigTest.test_restart`) #3486

cluster: Stale leadership in partition_leaders_table on leader (failure in `ClusterConfigTest.test_restart`) #3486

jcsp commented Jan 14, 2022

gousteris commented Jan 17, 2022

jcsp commented Jan 17, 2022

cluster: Stale leadership in partition_leaders_table on leader (failure in ClusterConfigTest.test_restart) #3486

cluster: Stale leadership in partition_leaders_table on leader (failure in ClusterConfigTest.test_restart) #3486

Comments

jcsp commented Jan 14, 2022

gousteris commented Jan 17, 2022

jcsp commented Jan 17, 2022

cluster: Stale leadership in partition_leaders_table on leader (failure in `ClusterConfigTest.test_restart`) #3486

cluster: Stale leadership in partition_leaders_table on leader (failure in `ClusterConfigTest.test_restart`) #3486