prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

bsbernd · 2020-04-30T21:38:33Z

Lets start with the root cause, by default a socket fd has limited number of 1B writes, because of internal linux overhead (I have no idea about BSD and other supported platforms). A small test program writing 1B to a socket can do that only 278 times without setting SO_SNDBUF and default procfs (centos7) settings. Setting SO_SNDBUF increases that to 555, but is still well below default CQ size of 1024 (we have actually increased that to 4096). That can be changed using SO_SNDBUFFORCE or setting /proc/sys/net/core/wmem_max, but that either requires root or CAP_NET_ADMIN permissions and not possible for us, or requires to change generic system settings and is not everywhere wanted. Other providers might be affected as well, for now I see a dead stall using prov/tcp

Our polling thread and which is the socket fd reader/consumer blocks on fastlock_acquire(&ep->lock);

Thread 73 (Thread 0x7ff3fdbc5700 (LWP 2609459)):
#0 pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:33
#1 0x00007ff6968c8a46 in tcpx_try_func (util_ep=util_ep@entry=0x61a00000fc80) at prov/tcp/src/tcpx_progress.c:657
#2 0x00007ff6968c7334 in tcpx_cq_progress (cq=0x61200000ff40) at prov/tcp/src/tcpx_cq.c:58
#3 0x00007ff69686921e in ofi_cq_readfrom (cq_fid=0x61200000ff40, buf=, count=0, src_addr=src_addr@entry=0x0) at prov/util/src/util_cq.c:245
#4 0x00007ff6968694a4 in ofi_cq_read (cq_fid=, buf=, count=) at prov/util/src/util_cq.c:312
#5 0x00007ff69686c6ed in fi_cq_read (count=0, buf=0x0, cq=0x61200000ff40) at ./include/rdma/fi_eq.h:385
#6 util_poll_run (poll_fid=0x607000013880, context=0x7ff3fdbc4828, count=1) at prov/util/src/util_poll.c:95
#7 0x00007ff69686cb97 in fi_poll (count=1, context=0x7ff3fdbc4828, pollset=) at ./include/rdma/fi_eq.h:326
#8 util_wait_fd_try (wait=0x60d000006f90) at prov/util/src/util_wait.c:253
#9 0x00007ff69686cc12 in ofi_trywait (fabric=, fids=0x7ff3fdbc4910, count=2) at prov/util/src/util_wait.c:70
#10 0x0000000000a79f5f in fi_trywait (fabric=0x608000002620, fids=0x7ff3fdbc4910, count=2) at /home/bschubert/local/rhel7/libfabric/include/rdma/fi_eq.h:314

The socket fd writer has acquired that lock and blocks because the socket fd buffer is full.

Thread 11 (Thread 0x7ff67a80b700 (LWP 2609250)):
#0 0x00007ff69538d1fd in write () at ../sysdeps/unix/syscall-template.S:81
#1 0x0000000000480ed1 in __interceptor_write (fd=, ptr=0x7ff67a809b9f, count=) at /scratch/gcc-9.2.0-build/../gcc-9.2.0/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:1087
#2 0x00007ff69686cbde in ofi_write_socket (count=1, buf=0x7ff67a809b9f, fd=) at ./include/unix/osd.h:129
#3 fd_signal_set (signal=0x60d000006fe8) at ./include/ofi_signal.h:91
#4 util_wait_fd_signal (util_wait=0x60d000006f90) at prov/util/src/util_wait.c:231
#5 0x00007ff6968697bf in util_cq_signal (cq=0x61200000ff40) at ./include/ofi_util.h:508
#6 ofi_cq_signal (cq_fid=cq_fid@entry=0x61200000ff40) at prov/util/src/util_cq.c:418
#7 0x00007ff6968c77c3 in tcpx_cq_report_success (cq=0x61200000ff40, xfer_entry=xfer_entry@entry=0x7ff3faae2960) at prov/tcp/src/tcpx_cq.c:166
#8 0x00007ff6968c83d1 in process_tx_entry (tx_entry=tx_entry@entry=0x7ff3faae2960) at prov/tcp/src/tcpx_progress.c:121
#9 0x00007ff6968c8b82 in tcpx_tx_queue_insert (tcpx_ep=tcpx_ep@entry=0x61a00000fc80, tx_entry=tx_entry@entry=0x7ff3faae2960) at prov/tcp/src/tcpx_progress.c:692
#10 0x00007ff6968c5680 in tcpx_send (ep=0x61a00000fc80, buf=0x7ff5fffde478, len=, desc=, dest_addr=, context=0x7ff5fffde000) at prov/tcp/src/tcpx_msg.c:263
#11 0x0000000000a3fdad in fi_send (ep=0x61a00000fc80, buf=0x7ff5fffde478, len=806, desc=0x40, dest_addr=18446744073709551615, context=0x7ff5fffde000) at /home/bschubert/local/rhel7/libfabric/include/rdma/fi_endpoint.h:291

One way to solve that for us might be to set our own wait object (FI_WAIT_SET), which then uses another wakeup way, like eventfd. But as this is a general problem, it should be solved in libfabric as well. Right now I don't have a good idea how, though. Using eventfd in general for fd-signals will not work, as this is linux specific. For prov/tcp it probably would help to do the ofi_cq_signal() outside of the ep lock, but at a minumum that requires some significant changes and I would like to get the opinion of upstream about this issue first.

Thanks,
Bernd

bsbernd · 2020-04-30T21:49:22Z

bernd@t420-work-1 tmp>cat socket-test.c
#include <errno.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <string.h>

// https://unix.stackexchange.com/questions/424380/what-values-may-linux-use-for-the-default-unix-socket-buffer-size
#define SOCK_OVERHEAD 576
#define CQ_SIZE 8192 // 4096 might be enough

#define NEEDED_SEND_BUF_SIZE (SOCK_OVERHEAD * 8192)

char buf[1024];

enum {
FI_READ_FD,
FI_WRITE_FD
};

int main(int argc, char **argv)
{
int rc;

int sfd[2];
rc = socketpair(AF_UNIX, SOCK_STREAM, 0, sfd) == -1;
if (rc)
{
    perror("error");
    exit(1);
}


int sendbuf;
socklen_t optlen = sizeof(sendbuf);
rc = getsockopt(sfd[FI_WRITE_FD], SOL_SOCKET, SO_SNDBUF, &sendbuf, &optlen);
if (rc)
{
    perror("error");
    exit(1);
}
printf("send-buf-size=%d\n", sendbuf);

sendbuf = NEEDED_SEND_BUF_SIZE;
int option = SO_SNDBUF;

// int option = SO_SNDBUFFORCE;
rc = setsockopt(sfd[FI_WRITE_FD], SOL_SOCKET, option, &sendbuf, optlen);
if (rc)
{
perror("error");
exit(1);
}

rc = getsockopt(sfd[FI_WRITE_FD], SOL_SOCKET, SO_SNDBUF, &sendbuf, &optlen);
if (rc)
{
    perror("error");
    exit(1);
}
printf("newsend-buf-size=%d\n", sendbuf);


int fd = sfd[FI_WRITE_FD];
long count = 0;
while (1)
{
    int rc = write(fd, &buf[0], 1);
    if (rc != 1)
    {
        fprintf(stderr, "Failed to write: %s", strerror(errno));
        exit(1);
    }

    count++;
    
    printf("count = %ld\n", count);
}

return 0;

}

bsbernd · 2020-04-30T21:51:49Z

Default wmem_max

cat /proc/sys/net/core/wmem_max
212992

./socket-test
...
count = 555

shefty · 2020-04-30T22:01:55Z

I'm not following the problem here. Is the ofi_signal code showing an issue? The signal code shouldn't ever have more than 1 byte written to the socket at a time.

bsbernd · 2020-04-30T23:20:25Z

tcpx_cq_report_success() writes 1 byte for every completed cq? So if you don't read from the signal socket sufficiently fast it will block at some point.

shefty · 2020-04-30T23:28:15Z

Are you referring to this call path: tcpx_cq_report_success() -> ofi_cq_signal() -> util_cq_signal() -> cq->wait->signal()?

If so, do you know what that callback is set to?

shefty · 2020-04-30T23:34:30Z

The call trace points to fd_signal_set().

static inline void fd_signal_set(struct fd_signal *signal)
{
	char c = 0;
	if (signal->wcnt == signal->rcnt) {
		if (ofi_write_socket(signal->fd[FI_WRITE_FD], &c, sizeof c) == sizeof c)
			signal->wcnt++;
	}
}

The write socket call is only invoked if the write cnt equals the read cnt. Until someone reads from the socket (calls fd_signal_reset()) we shouldn't write to it again.

The design is to avoid writing to the socket (and taking the kernel transition) unless necessary.

shefty · 2020-04-30T23:40:23Z

Is this limit a process limit, rather than a per socket limit? How many CQs is the application using?

The fd_signal abstraction sets the read fd to nonblocking, but not the write fd. The assumption was that there was no need, since we shouldn't ever write more than 1 byte per socket. We can set the write fd to nonblocking, but a failure here means that we've lost the signal, which could likely cause issues elsewhere.

bsbernd · 2020-05-01T00:10:42Z

Sorry it is getting very very late (well, already early here), I will reply in more detail after getting some sleep. There might be very simple solution - pipe instead of the socket. I need to check why it is writing to the socket the (signal->wcnt == signal->rcnt) condition - in about 8 to 9 hours.
There might be a simple solution, though, if the fd_signal_set() part works correctly - using a pipe instead of a socket. I just did a simple pipe test and that allows to write up to 64K 1B entries.

shefty · 2020-05-01T00:25:16Z

I don't think pipe is available on windows. But worry about it tomorrow.

bsbernd · 2020-05-01T20:21:12Z

Thanks a lot Sean for your help and your pointer to fd_signal_set()! I had seen this function in my traces, but didn't pay attention to it. In the end it turns out is just a missing lock protecting signal->wcnt and signal->rcnt from multi threaded access.

#5900

bsbernd · 2020-05-19T16:42:52Z

Thanks for merging the patch, I'm closing the issue.

bsbernd closed this as completed May 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

bsbernd commented Apr 30, 2020 •

edited

Loading

bsbernd commented Apr 30, 2020

bsbernd commented Apr 30, 2020 •

edited

Loading

shefty commented Apr 30, 2020

bsbernd commented Apr 30, 2020 •

edited

Loading

shefty commented Apr 30, 2020

shefty commented Apr 30, 2020

shefty commented Apr 30, 2020

bsbernd commented May 1, 2020

shefty commented May 1, 2020

bsbernd commented May 1, 2020

bsbernd commented May 19, 2020

prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

prov/tcp dead stall - generic ofi_write_socket buffer size issue #5897

Comments

bsbernd commented Apr 30, 2020 • edited Loading

bsbernd commented Apr 30, 2020

bsbernd commented Apr 30, 2020 • edited Loading

shefty commented Apr 30, 2020

bsbernd commented Apr 30, 2020 • edited Loading

shefty commented Apr 30, 2020

shefty commented Apr 30, 2020

shefty commented Apr 30, 2020

bsbernd commented May 1, 2020

shefty commented May 1, 2020

bsbernd commented May 1, 2020

bsbernd commented May 19, 2020

bsbernd commented Apr 30, 2020 •

edited

Loading

bsbernd commented Apr 30, 2020 •

edited

Loading

bsbernd commented Apr 30, 2020 •

edited

Loading