Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write: Protect against sending messages after close #476

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

FrauElster
Copy link

@FrauElster FrauElster commented Sep 5, 2024

Closes #448

@mafredri
Copy link
Member

mafredri commented Sep 5, 2024

Hey, thanks for the PR. I do think we should take a slightly different approach, though. Instead of piggy-backing on closeMutex in writeClose and potentially introduce deadlocks due to multiple consumers of the mutex, we move the logic to writeFrame instead. Then we can rely on the already existing writeFrameMu and set a boolean for the close when opcode == opClose.

This has another nice benefit of catching the close message wherever it may originate from.

Edit: Looks like CI failed with a test timeout so this did indeed introduce a deadlock: https://github.com/coder/websocket/actions/runs/10717638649/job/29717749114?pr=476#step:4:97

@mafredri mafredri self-requested a review September 5, 2024 10:18
@mafredri mafredri changed the title adds a lock on writeClose. closes #448 close: Add a lock on writeClose Sep 5, 2024
@FrauElster
Copy link
Author

Sure great idea. I just had this bug and need it fixed and in my experience fixing it and opening a PR is the fastest way. I did not read nor understand the whole codebase, just stepped through it with my debugger and implemented this quick and dirty solution.

Just for clarification if you say we, do you mean you or me? Clearly you already have the solution on your mind, but if you are busy I can also do it.

@FrauElster
Copy link
Author

As suggested I moved the writeSent read/write into writeFrame.
However I felt bad to fail silently if a double Close occurred, and is not really an net.ErrClosed since the connection could still be open waiting for the close response, so I introduced a new ErrAlreadyClosed, but currently it is not handled anywhere.

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes! I took the liberty to push a few myself to your branch, hope you don't mind. (See review comments.)

Just for clarification if you say we, do you mean you or me? Clearly you already have the solution on your mind, but if you are busy I can also do it.

My thinking is that I'm happy to let you work on this since you started the PR, and hopefully land a contribution to the project.

Side-note: I typically write/talk in _we_s as I see coding as a group effort, this way both successes and failures are shared and not the responsibility of any one individual. My hope is that it also sparks inclusivity by feeling like a group effort instead of one person making decisions.


TBH I think this PR is in pretty good shape now, but I think we should add at least one test testing this new behavior. Would you want to tackle that?

conn.go Show resolved Hide resolved
write.go Show resolved Hide resolved
@mafredri mafredri changed the title close: Add a lock on writeClose write: Protect against sending messages after close Sep 6, 2024
@mitar
Copy link

mitar commented Sep 8, 2024

Very interesting. I came to this PR because I am debugging our tests. In our tests we measure how many bytes we send over a websocket (using this library for both client and server in Go) and how many we get for a simple sending of text message hi from the client and then sending hi back from the server to the client.

With v1.8.10, before closing was rewritten, we had {"wsFromClient":16,"wsToClient":8} bytes transmitted. With v1.8.12, we saw change to {"wsFromClient":16,"wsToClient":12} which I understood that the increase in wsToClient is because server is now properly closing. But then I noticed that it is not always like that and that sometimes it is {"wsFromClient":24,"wsToClient":12} because it seems client sends close twice. So I got to this PR and tried this PR and with this PR I am back to {"wsFromClient":16,"wsToClient":8}, always. Not sure why not {"wsFromClient":16,"wsToClient":12}.

@FrauElster
Copy link
Author

Nice to hear that it makes things more predictable. :)

@mafredri Yeah a test would be great. Sadly I am on a business trip until next monday, so until then this PR would be stall, unless you would not mind.

@FrauElster
Copy link
Author

So I had a moment to have look and wrote several approaches, but I have to admit that I find it kindof hard to test. I am not familiar enough with the code, so I'll describe some assumptions here.

My general idea is to open a server and a client connection, close one and count the amount of close messages received on both ends. Both are asserted to be one.

Assumption 1:
The problem is that the exposed methods are highlevel and manage the closing handshake themself without returning it to the callee (e.g. conn.Read()) , so I cannot count them.

Assumption 2:
A connection ignores any messages received after the connection was closed. So it just gets discarded and I cannot count any more than 1 received closed message. (This behaviour differs e.g. compared to websocket client of chrome).

I am tempting to have a private callback on the connection options, but introducing private dependency injection just to be able to test it seems overkill, and introduces to much complexity for the benefits. (IMHO).

Second attempt would be to write a separat minimal websocket client / server that can count, but this seems like a bunch of work for such a simple test.

I saw that there are some helpers like EchoServer, one could also introduce something similar that can count opcodes, but also this would need a separat implementation since it is not in the websocket package.

For now I am using my fork (thanks to go mod replace thats pretty easy) to fix the problems i am currently experiencing in production.

Looking forward to get some feedback on the testing issue.

@FrauElster
Copy link
Author

Another note:

your last commit makes my test (go test -count=10 ./...) gives me the "conn_test.go:623: failed to close WebSocket: failed to read frame header: EOF" again (sometimes). So the race conditions seems to be back in the current state.

@mafredri
Copy link
Member

mafredri commented Sep 11, 2024

Another note:

your last commit makes my test (go test -count=10 ./...) gives me the "conn_test.go:623: failed to close WebSocket: failed to read frame header: EOF" again (sometimes). So the race conditions seems to be back in the current state.

Thanks @FrauElster, I can reproduce this. I took a look and I'm pretty sure that the fix simply surfaced a new issue. Calling (*Conn).CloseRead(ctx) seems to interfere with the close handshake, it hijacks the reader so that the close handshake isn't propagated during actual close. This in turn is closing the connection early so that (*Conn).Close() runs into io.EOF. (Parts of this could be wrong, I'm still investigating exactly what is happening.)

I appreciate you taking a look and thinking about ways to test for this, if you want though, I can take over as I want to look more closely at the way we close connections.

PS. Are you using CloseRead(ctx) yourself? And have you tried this branch as it is right now as a fix for the problems you're experiencing? If we disregard the failing test, have you been able to show that this fix doesn't work in your use-case?

@FrauElster
Copy link
Author

No I don't use CloseRead(ctx) in my application. I use the simple

msgType, data, err := conn.Read(ctx)
if websocket.CloseStatus(err) != -1 { ... }

or

conn.Close(websocket.StatusInternalError, string(reason))

@mafredri
Copy link
Member

@FrauElster I pushed a fix for the failed to close WebSocket: failed to read frame header: EOF issue. Let me know if you want me to open a separate PR since I'm pushing to your fork (I don't mind but don't want to cause conflicts for you).

@mitar I'm curious if c3613fc causes any change in the behavior you're seeing? I have noticed that there are still cases where the writeTimeout and readTimeout contexts aren't cleared and could potentially cause an abrupt closure of the connection. I'm not saying that's what's happening to you, just a potential cause.

@FrauElster
Copy link
Author

Sure go ahead. My production go.mod is pinned to a commit, so no worries. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failed: Close received after close
3 participants