Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: race causing hangs in fi_multinode test #8090

Closed
aingerson opened this issue Oct 11, 2022 · 1 comment
Closed

prov/psm3: race causing hangs in fi_multinode test #8090

aingerson opened this issue Oct 11, 2022 · 1 comment

Comments

@aingerson
Copy link
Contributor

Describe the bug
psm3 over verbs is failing about one out of every 5 runs of our CI with a hang during the fi_multinode test (3 peers). I can reproduce it by hand (not consistently). Most often I see it during a barrier where 2 of peers have sent all their messages but one gets stuck receiving that message. Below is a backtrace from the peer that is stuck receiving a message.

To Reproduce
fi_multinode -p psm3 -C msg -n 3 -s
^ running server on one node and 2 clients (same command) on a different node. Not sure if this is a necessary factor to reproduce. This is just what our CI does.

Output
psm3_verbs_recvhdrq_progress (recvq=0x10addf8) at prov/psm3/psm3/hal_verbs/verbs_recvhdrq.c:189
189 PSMI_CACHEALIGN struct ips_recvhdrq_event rcv_ev = {
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-164.el8.x86_64 libibverbs-56mlnx40-1.56103.x86_64 libnl3-3.5.0-1.el8.x86_64 librdmacm-56mlnx40-1.56103.x86_64 libuuid-2.32.1-28.el8.x86_64 numactl-libs-2.0.12-13.el8.x86_64
(gdb) bt
#0 psm3_verbs_recvhdrq_progress (recvq=0x10addf8) at prov/psm3/psm3/hal_verbs/verbs_recvhdrq.c:189
#1 0x00007f7f31c928fa in psm3_verbs_ips_ptl_poll (ptl_gen=0x10a8300, _ignored=0) at prov/psm3/psm3/hal_verbs/verbs_ptl_ips.c:116
#2 0x00007f7f31c98669 in psm3_poll_internal (ep=0x10a7b40, poll_amsh=1) at prov/psm3/psm3/psm.c:1624
#3 0x00007f7f31cadac6 in psm3_mq_ipeek_dequeue_multi (mq=0x101f250, status_array=0x7ffe725b8b00, status_copy=0x7f7f31c5b65e <psmx3_mq_status_copy>, count=0x7ffe725b8ae4)
at prov/psm3/psm3/psm_mq.c:1154
#4 0x00007f7f31c5d163 in psmx3_cq_poll_mq (cq=0x10244d0, trx_ctxt=0x1022910, event_in=0x7ffe725b8c60, count=0, src_addr=0x0) at prov/psm3/src/psmx3_cq.c:833
#5 0x00007f7f31c5d220 in psmx3_cq_readfrom (cq=0x10244d0, buf=0x7ffe725b8c60, count=1, src_addr=0x0) at prov/psm3/src/psmx3_cq.c:861
#6 0x00007f7f31c5d52a in psmx3_cq_read (cq=0x10244d0, buf=0x7ffe725b8c60, count=1) at prov/psm3/src/psmx3_cq.c:949
#7 0x0000000000404da9 in fi_cq_read (cq=0x10244d0, buf=0x7ffe725b8c60, count=1) at /home/aingerso/install/libfabric/include/rdma/fi_eq.h:394
#8 0x000000000040e533 in ft_spin_for_comp (cq=0x10244d0, cur=0x61be60 <rx_cq_cntr>, total=6, timeout=-1) at common/shared.c:2287
#9 0x000000000040e949 in ft_get_cq_comp (cq=0x10244d0, cur=0x61be60 <rx_cq_cntr>, total=6, timeout=-1) at common/shared.c:2378
#10 0x000000000040ec62 in ft_get_rx_comp (total=6) at common/shared.c:2458
#11 0x0000000000403b7f in send_recv_barrier (sync=0) at multinode/src/core.c:395
#12 0x0000000000403d69 in multi_run_test () at multinode/src/core.c:442
#13 0x00000000004040c3 in multinode_run_tests (argc=9, argv=0x7ffe725b8ee8) at multinode/src/core.c:505
#14 0x0000000000402770 in main (argc=9, argv=0x7ffe725b8ee8) at multinode/src/harness.c:371

Environment:
Linux

@aingerson aingerson added the bug label Oct 11, 2022
aingerson added a commit to aingerson/libfabric that referenced this issue Oct 14, 2022
psm3 is showing transient failures with the multinode test.
Will re-enable once issue ofiwg#8090 is resolved.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
aingerson added a commit that referenced this issue Oct 17, 2022
psm3 is showing transient failures with the multinode test.
Will re-enable once issue #8090 is resolved.

Signed-off-by: Alexia Ingerson <alexia.ingerson@intel.com>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 7, 2023

This issue is stale because it has been open 360 days with no activity. Remove stale label or comment, otherwise it will be closed in 7 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant