Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

Closed
liuxuezhao opened this issue May 26, 2017 · 0 comments
Closed

send msg to a dead process got "-FI_EAGAIN" for socket provider #3007

liuxuezhao opened this issue May 26, 2017 · 0 comments

Comments

@liuxuezhao
Copy link
Contributor

We found one problem when sending msg (by calling fi_tsend) to a failed/killed target process over fi socket provider.
We got a hang at that case, it is because in the case of connection failure the fi_tsend() returns “-FI_EAGAIN”, so in our code we just retry it when seeing the EAGAIN. Then it caused the dead loop.

The back trace is:
#0 0x00007f1a0e343efd in nanosleep () from /lib64/libc.so.6
#1 0x00007f1a0e343d94 in sleep () from /lib64/libc.so.6
#2 0x00007f1a0cc34df7 in sock_ep_connect (ep_attr=0x18e9830, index=2) at prov/sockets/src/sock_conn.c:496
#3 0x00007f1a0cc21e4d in sock_ep_get_conn (attr=0x18e9830, tx_ctx=0x18ea080, index=2, pconn=0x7ffe324c4688) at prov/sockets/src/sock_ep.c:1819
#4 0x00007f1a0cc3669b in sock_ep_tsendmsg (ep=0x18e9740, msg=0x7ffe324c4740, flags=2305843009213693952) at prov/sockets/src/sock_msg.c:558
#5 0x00007f1a0cc36a2f in sock_ep_tsend (ep=0x18e9740, buf=0x194d000, len=170, desc=0x194a340, dest_addr=2, tag=4, context=0x194c8c0) at prov/sockets/src/sock_msg.c:646
#6 0x00007f1a0ea60fca in fi_tsend (context=0x194c8c0, tag=4, dest_addr=, desc=0x194a340, len=170, buf=0x194d000, ep=0x18e9740)
at /home/xliu9/src/daos_m/install/include/rdma/fi_tagged.h:116

The sock_ep_connect retried 5 times (sleep 10 second each time) and returns NULL to sock_ep_get_conn() inside that returns “-FI_EAGAIN” (because errno == EINPROGRESS) to user.

Two questions related to this issue:

  1. in the case of the target already dead/not reachable, is it possible that the send() API returns a more proper error code rather than “-FI_EAGAIN”, as the EAGAIN seems mean that user is just free to retry it.
  2. can see the sleep(10) in sock_ep_connect:

retry:
do_retry--;
sleep(10);
if (!do_retry)
goto err;
Is it possible to refine it that removing that long period sleep()? As it will cause a hug delay if user tries to connect some one not reachable.

@shefty shefty closed this as completed Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants