Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

peer: send reestablish, shutdown messages before starting writeHandler #8186

Merged
merged 1 commit into from
Nov 16, 2023

Conversation

Crypt-iQ
Copy link
Collaborator

@Crypt-iQ Crypt-iQ commented Nov 16, 2023

This is to avoid a race on WriteMessage and Flush internals. Because there is no locking on WriteMessage or Flush, if we allow writeMessage calls in Start after the writeHandler has started, the writeMessage calls may call WriteMessage/Flush at the same time that writeMessage calls from the writeHandler does. Since there is no locking, internals like b.nextHeaderSend can race and cause panics.

Fixes #8184

peer/brontide.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have a test for this, to ensure we don't cause a similar breakage in the future.

If that's too difficult, second-best is probably some comments on writeMessage explaining that it should only be used by writeHandler.

peer/brontide.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🙏 and +1 for detailed documentation.

This is to avoid a potential race on WriteMessage and Flush internals.
Because there is no locking on WriteMessage and Flush, if we allow
writeMessage calls in Start after the writeHandler has started,
the writeMessage calls may call WriteMessage/Flush at the same time
that writeMessage calls from the writeHandler does. Since there is
no locking, internals like b.nextHeaderSend can race and cause
panics.
Copy link
Collaborator

@morehouse morehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -703,6 +703,23 @@ func (p *Brontide) Start() error {

p.startTime = time.Now()

// Before launching the writeHandler goroutine, we send any channel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we abstract out loadActiveChannels, or split it into: load chans and load messages, then we can write a unit test here to try to trigger the panic. IIUC, it relies on concurrent access to the brontide state machine, so the two goroutines need to line up directly. This may be easier to trigger with the -race flag on.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a unit test that detects the race. Will send a PR once I clean it up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Roasbeef
Copy link
Member

Will make the release branch, merge this in, and add release notes. Goal here is a fast follow minor release with just this fix, if we can tighten up uni tests here than that's also ideal. Hopefully we're finally putting this saga to bed, and can devote resources to refactoring/re-writing this portion to get away from the mutex issues that started the saga in the first place.

@Roasbeef Roasbeef merged commit 64753b0 into lightningnetwork:master Nov 16, 2023
21 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug]: v0.17.1-beta: panic: runtime error: slice bounds out of range [18:0]
6 participants