Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock (?) between garbage collection and RcvQueue worker thread termination #83

Open
jrudolph opened this issue May 24, 2016 · 1 comment

Comments

@jrudolph
Copy link

We observe a situation where UDT completely hangs with many threads stuck waiting for the m_ControlLock.

At this point the lock is held by the garbage collection thread (in checkBrokenSockets) which is waiting for a rcv queue worker thread termination:

(gdb) bt
#0  0x00007f5b9f593ef7 in pthread_join (threadid=140028744247040, thread_return=0x0) at pthread_join.c:92
#1  0x00007f5b5c3b6221 in CRcvQueue::~CRcvQueue() () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#2  0x00007f5b5c39b0bd in CUDTUnited::removeSocket(int) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#3  0x00007f5b5c39baa2 in CUDTUnited::checkBrokenSockets() () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#4  0x00007f5b5c39bc64 in CUDTUnited::garbageCollect(void*) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#5  0x00007f5b9f592dc5 in start_thread (arg=0x7f5b17fff700) at pthread_create.c:308
#6  0x00007f5b9eea628d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) frame 0
#0  0x00007f5b9f593ef7 in pthread_join (threadid=140028744247040, thread_return=0x0) at pthread_join.c:92
92      lll_wait_tid (pd->tid);
(gdb) print pd->tid
$3 = 17122

The worker thread seems to be stuck in recvmsg:

Thread 7 (Thread 0x7f5afb8f2700 (LWP 17122)):
#0  0x00007f5b9f59967d in recvmsg () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f5b5c3a0b2b in CChannel::recvfrom(sockaddr*, CPacket&) const () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#2  0x00007f5b5c3b6fee in CRcvQueue::worker(void*) () from /tmp/udt_jndi_lib/lib/amd64-Linux-gpp/jni/libbarchart-udt-core-2.3.0-SNAPSHOT.so
#3  0x00007f5b9f592dc5 in start_thread (arg=0x7f5afb8f2700) at pthread_create.c:308
#4  0x00007f5b9eea628d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

This doesn't seem to be a classical deadlock, maybe it's more a problem with the blocking recvmsg call.

Has anyone an idea how this could happen?

@jrudolph
Copy link
Author

I suspect that the problem is related to the one described here: https://sourceforge.net/p/udt/discussion/393036/thread/d95e119f/?limit=25#1c43

By performing the close() before deleting the queues, doesn't this allow for the possibility that between the close() and the queue deletion, a new socket using the old file descriptor could be created in another thread and one or both of the queues could improperly use that new file descriptor? I did not see any synchronization which would prevent this problem. Would moving the channel close() to happen after the queues have been deleted introduce other problems?

In my case, however, the file descriptor is not reused by UDT but by another part of the application which opens a completely unrelated TCP socket with the same file descriptor. This new socket is perfectly fine and will happily block in the recvmsg call bringing UDT to a halt completely.

jrudolph added a commit to jrudolph/barchart-udt that referenced this issue May 27, 2016
…chart#83

Otherwise, the socket is closed and the file descriptor is freed prematurely.
Due to a race condition the queues might then still be trying to use the old
file descriptor which by now may already point to another unrelated socket. This
may either lead to "stolen" data, data that is read accidentally by the
queue worker for the already closed socket. Or it may lead to a complete deadlock,
if the file descriptor now points to a blocking socket, so that
`delete m->second.m_pRcvQueue` will never return because it joins the worker
thread which blocks indefinitely on the wrong socket. After some time this
deadlock will bring UDT completely to a halt because the above code holds the
m_ControlLock into which all other work will run and block there after a while.
jrudolph added a commit to RBMHTechnology/barchart-udt that referenced this issue Jun 1, 2016
…chart#83

Otherwise, the socket is closed and the file descriptor is freed prematurely.
Due to a race condition the queues might then still be trying to use the old
file descriptor which by now may already point to another unrelated socket. This
may either lead to "stolen" data, data that is read accidentally by the
queue worker for the already closed socket. Or it may lead to a complete deadlock,
if the file descriptor now points to a blocking socket, so that
`delete m->second.m_pRcvQueue` will never return because it joins the worker
thread which blocks indefinitely on the wrong socket. After some time this
deadlock will bring UDT completely to a halt because the above code holds the
m_ControlLock into which all other work will run and block there after a while.

(cherry picked from commit ccb843e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant