-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
socket_vmnet gets stuck randomly #39
Comments
More information on the issue:
Logs after the VM corresponding to socket 11 is stopped: |
On further analysis, it looks like a deadlock problem: |
This code has multiple threads writing to a socket at the same time. This will cause problem - potentially corrupting packets. We need to remove the flooding by mapping MAC addresses to socket-ids. so that 2 threads don't write to a given socket at the same time. |
@sheelchand Thanks for analysis, would you be interested in submitting a PR? |
I tried but the Mac’s in the packets don’t match the VMs Macs. I tried putting a semephore before sending to a socket but that did not help either. I will continue to look at it. |
There are 2 race conditions when writing to connections: - Since we don't hold the semaphore during iteration, a connection can be removed by another thread while we try to use it, which will lead to use after free. - Multiple threads may try to write to the same connection socket, corrupting the packets (lima-vm#39). Both issues are fixed by holding the semaphore during iteration and writing to the socket. This is not the most efficient way but socket_vmnet crashes daily and we must stop the bleeding first. We can to add more fine grain locking later.
@sheelchand removing flooding is important for performance but it will not solve the issue of writing to the same socket at the same time from different threads. Example flow when we send each packet only the destination:
writev(2) does not mention anything about thread safety or message size that can be written atomically, so we should assume that this is unsafe. send(2) seems safe:
But using send we will have to copy the packet so do one syscall, so writev seems better way. Specially if we find how to send multiple packets per syscall instead of one without sending all packets to all the guests. |
There are 2 race conditions when writing to connections: - Since we don't hold the semaphore during iteration, a connection can be removed by another thread while we try to use it, which will lead to use after free. - Multiple threads may try to write to the same connection socket, corrupting the packets (lima-vm#39). Both issues are fixed by holding the semaphore during iteration and writing to the socket. This is not the most efficient way but socket_vmnet crashes daily and we must stop the bleeding first. We can to add more fine grain locking later.
We have 7 qemu VMs running, having 3 virtual ethernet interface each.
socket_vmnet works most of the times but randomly stops working and the communication between the VMs is stopped.
The debug logs show the process get stuck on writev() call.
DEBUG| [Socket-to-Socket i=1815762] Sending from socket 8 to socket 5: 4 + 95 bytes
There is no log after the above log:
On the VM reboot the logs show that writev() call return -1
I suspect this is due to a race condition when multiple threads are accessing the socket to send and receive data. I don't have the exact explanation yet bet the behavior is pointing to a race condition.
The text was updated successfully, but these errors were encountered: