send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

liuxuezhao · 2017-05-26T01:54:45Z

We found one problem when sending msg (by calling fi_tsend) to a failed/killed target process over fi socket provider.
We got a hang at that case, it is because in the case of connection failure the fi_tsend() returns “-FI_EAGAIN”, so in our code we just retry it when seeing the EAGAIN. Then it caused the dead loop.

The back trace is:
#0 0x00007f1a0e343efd in nanosleep () from /lib64/libc.so.6
#1 0x00007f1a0e343d94 in sleep () from /lib64/libc.so.6
#2 0x00007f1a0cc34df7 in sock_ep_connect (ep_attr=0x18e9830, index=2) at prov/sockets/src/sock_conn.c:496
#3 0x00007f1a0cc21e4d in sock_ep_get_conn (attr=0x18e9830, tx_ctx=0x18ea080, index=2, pconn=0x7ffe324c4688) at prov/sockets/src/sock_ep.c:1819
#4 0x00007f1a0cc3669b in sock_ep_tsendmsg (ep=0x18e9740, msg=0x7ffe324c4740, flags=2305843009213693952) at prov/sockets/src/sock_msg.c:558
#5 0x00007f1a0cc36a2f in sock_ep_tsend (ep=0x18e9740, buf=0x194d000, len=170, desc=0x194a340, dest_addr=2, tag=4, context=0x194c8c0) at prov/sockets/src/sock_msg.c:646
#6 0x00007f1a0ea60fca in fi_tsend (context=0x194c8c0, tag=4, dest_addr=, desc=0x194a340, len=170, buf=0x194d000, ep=0x18e9740)
at /home/xliu9/src/daos_m/install/include/rdma/fi_tagged.h:116

The sock_ep_connect retried 5 times (sleep 10 second each time) and returns NULL to sock_ep_get_conn() inside that returns “-FI_EAGAIN” (because errno == EINPROGRESS) to user.

Two questions related to this issue:

in the case of the target already dead/not reachable, is it possible that the send() API returns a more proper error code rather than “-FI_EAGAIN”, as the EAGAIN seems mean that user is just free to retry it.
can see the sleep(10) in sock_ep_connect:

retry:
do_retry--;
sleep(10);
if (!do_retry)
goto err;
Is it possible to refine it that removing that long period sleep()? As it will cause a hug delay if user tries to connect some one not reachable.

shefty closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

liuxuezhao commented May 26, 2017

send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

Comments

liuxuezhao commented May 26, 2017