Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a rare panic during consumer group rebalance #791

Merged
merged 2 commits into from
Dec 6, 2021

Conversation

stevevls
Copy link
Contributor

Currently, there is a sync.WaitGroup that is used for process accounting
with go routines started by Generation.Start. It is used when a
generation is ending to ensure that all of the Start-ed functions exit
before the next generation can start. There is an edge case that can
cause panic as written up in #786 due to unsafe use of the WaitGroup.
Basically, it is possible that close() is already in process and is
waiting on WaitGroup.Wait() when Start() comes in and calls
WaitGroup.Add(1).

The contract of Start() is that the provided function is to be run.
Code in the wild may depend on this behavior, so it's not an option to
return from Start() without running the provided function.

Accordingly, this PR updates the code to coordinate between close() and
Start() such that the panic case is no longer possible while preserving
the existing contract. It uses channels and a mutex in order to create
two cases: the normal case where the generation is alive and the edge
case where the generation has already ended.

Currently, there is a sync.WaitGroup that is used for process accounting
with go routines started by Generation.Start.  It is used when a
generation is ending to ensure that all of the Start-ed functions exit
before the next generation can start.  There is an edge case that can
cause panic as written up in #786 due to unsafe use of the WaitGroup.
Basically, it is possible that close() is already in process and is
waiting on WaitGroup.Wait() when Start() comes in and calls
WaitGroup.Add(1).

The contract of Start() is that the provided function is to be run.
Code in the wild may depend on this behavior, so it's not an option to
return from Start() without runing the provided function.

Accordingly, this PR updates the code to coordinate between close() and
Start() such that the panic case is no longer possible while preserving
the existing contract.  It uses channels and a mutex in order to create
two cases: the normal case where the generation is alive and the edge
case where the generation has already ended.
Copy link
Contributor

@achille-roussel achille-roussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good 👍

consumergroup.go Outdated Show resolved Hide resolved
Co-authored-by: Achille <achille@segment.com>
@stevevls stevevls merged commit bc25f16 into main Dec 6, 2021
@stevevls stevevls deleted the svls/alternate-approach-to-786 branch December 6, 2021 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants