Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: clear preferredReadReplica if broker shutdown #2108

Merged
merged 2 commits into from
Jan 12, 2022

Conversation

dnwe
Copy link
Collaborator

@dnwe dnwe commented Jan 12, 2022

After Sarama had been given a preferred replica to consume from, it was mistakenly latching onto that value and not unsetting it in the case that the preferred replica broker was shutdown and left the cluster metadata.

Fetches continued to work as long as that broker remained shutdown, because they were now being sent to the Leader, which would service them itself as it had no better preferred replica to point the client at.

However, consumption would then hang after the broker came back online, because the Leader would stop returning records in the FetchResponse and would instead just return the preferred replicaID, expecting the client to send its FetchRequests over there. However, because the partitionConsumer had latched the value of preferredReplica it never dispatched to (re-)connect to the preferred replica and instead just continued to send FetchRequests to the leader and received no records back.

Contributes-to: #2090

After Sarama had been given a preferred replica to consume from, it was
mistakenly latching onto that value and not unsetting it in the case
that the preferred replica broker was shutdown and left the cluster
metadata.

Fetches continued to work as long as that broker remained shutdown,
because they were now being sent to the Leader, which would service them
itself as it had no better preferred replica to point the client at.

However, consumption would then hang after the broker came back online,
because the Leader would stop returning records in the FetchResponse and
would instead just return the preferred replicaID, expecting the
client to send its FetchRequests over there. However, because the
partitionConsumer had latched the value of preferredReplica it never
dispatched to (re-)connect to the preferred replica and instead just
continued to send FetchRequests to the leader and received no records
back.

Contributes-to: #2090

Signed-off-by: Dominic Evans <dominic.evans@uk.ibm.com>
@dnwe dnwe added the fix label Jan 12, 2022
Copy link
Contributor

@bai bai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 lgtm

@dnwe dnwe merged commit a059adb into main Jan 12, 2022
@dnwe dnwe deleted the dnwe/fix-consumer-from-follower branch January 12, 2022 09:29
@lizthegrey
Copy link
Contributor

Confirming, this fix solved our problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants