Skip to content

prov/efa: Issues when closing multiple RDM endpoints #11268

@k0zmo

Description

@k0zmo

I have an application that creates multiple RDM endpoints, some of send-only and some of receive-only. Send-only endpoints have different domain(s) than receive-only endpoints. Each endpoint is transmitting/receiving approximately 2.2Gbps to remote instance of the same application. When I stop one of the application another one sees the remote endpoints are down and wants to close its own endpoints. After waiting for a relatively short period the endpoints are recreated and we try to establish the control connection (TCP). Each such endpoint is managed by a separate thread. Problem is that the application can behave in 3 different ways:

  • It can keep doing that create-release RDM endpoints without any observed issues,
  • It crashes with a segfault,
  • One of the thread that manages RDM endpoint spins CPU core to 100%

For the 2nd scenario, the backtrace is following:

#0  0x00007516fc14bf4d in efa_rdm_txe_handle_error () from /home/ubuntu/framecache/libfabric/libefa-fi.so
#1  0x00007516fc14a00d in efa_rdm_pke_handle_tx_error () from /home/ubuntu/framecache/libfabric/libefa-fi.so
#2  0x00007516fc136590 in efa_rdm_cq_poll_ibv_cq () from /home/ubuntu/framecache/libfabric/libefa-fi.so
#3  0x00007516fc13a496 in efa_rdm_ep_close () from /home/ubuntu/framecache/libfabric/libefa-fi.so
#4  0x00005d258760b688 in fi_close (fid=<optimized out>) at /opt/grassvalley/dev/libfabric_2.2.0_linux/include/rdma/fabric.h:644

I run this with address sanitizer enabled and that gave me following output:

=================================================================
==44938==ERROR: AddressSanitizer: heap-use-after-free on address 0x7e4df2085dd0 at pc 0x7e4f7d8915ee bp 0x7e4dea3dffd0 sp 0x7e4dea3dffc0
WRITE of size 8 at 0x7e4df2085dd0 thread T87 (20000000-0000-0)
    #0 0x7e4f7d8915ed in dlist_remove include/ofi_list.h:97
    #1 0x7e4f7d8915ed in efa_rdm_txe_handle_error prov/efa/src/rdm/efa_rdm_ope.c:711
    #2 0x7e4f7d88b900 in efa_rdm_pke_handle_tx_error prov/efa/src/rdm/efa_rdm_pke_cmd.c:512
    #3 0x7e4f7d847629 in efa_rdm_cq_poll_ibv_cq prov/efa/src/rdm/efa_rdm_cq.c:614
    #4 0x7e4f7d855d8e in efa_rdm_ep_wait_send prov/efa/src/rdm/efa_rdm_ep_fiops.c:842
    #5 0x7e4f7d855d8e in efa_rdm_ep_close prov/efa/src/rdm/efa_rdm_ep_fiops.c:929
    #6 0x6352bb2826d7 in fi_close /opt/grassvalley/dev/libfabric_2.2.0_linux/include/rdma/fabric.h:644
    #7 0x6352bb2826d7 in Miranda::Mocha::FrameCache::FabricInterfaceDeleter<fid_ep>::operator()(fid_ep*) /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/RdmaUtils.h:29
    #8 0x6352bb2826d7 in std::_Sp_counted_deleter<fid_ep*, Miranda::Mocha::FrameCache::FabricInterfaceDeleter<fid_ep>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/include/c++/10/bits/shared_ptr_base.h:474
    #9 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:158
    #10 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:151
    #11 0x6352bb2bf35f in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/10/bits/shared_ptr_base.h:736
    #12 0x6352bb2bf35f in std::__shared_ptr<fid_ep, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/10/bits/shared_ptr_base.h:1188
    #13 0x6352bb2bf35f in std::shared_ptr<fid_ep>::~shared_ptr() /usr/include/c++/10/bits/shared_ptr.h:121
    #14 0x6352bb2bf35f in Miranda::Mocha::FrameCache::EfaEndpoint::~EfaEndpoint() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/EfaEndpoint.h:24
    #15 0x6352bb2bf35f in void __gnu_cxx::new_allocator<Miranda::Mocha::FrameCache::EfaEndpoint>::destroy<Miranda::Mocha::FrameCache::EfaEndpoint>(Miranda::Mocha::FrameCache::EfaEndpoint*) /usr/include/c++/10/ext/new_allocator.h:162
    #16 0x6352bb2bf35f in void std::allocator_traits<std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint> >::destroy<Miranda::Mocha::FrameCache::EfaEndpoint>(std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint>&, Miranda::Mocha::FrameCache::EfaEndpoint*) /usr/include/c++/10/bits/alloc_traits.h:531
    #17 0x6352bb2bf35f in std::_Sp_counted_ptr_inplace<Miranda::Mocha::FrameCache::EfaEndpoint, std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/include/c++/10/bits/shared_ptr_base.h:560
    #18 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:158
    #19 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:151
    #20 0x6352bb2ec79e in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/10/bits/shared_ptr_base.h:736
    #21 0x6352bb2ec79e in std::__shared_ptr<Miranda::Mocha::FrameCache::EfaEndpoint, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/10/bits/shared_ptr_base.h:1188
    #22 0x6352bb2ec79e in std::__shared_ptr<Miranda::Mocha::FrameCache::EfaEndpoint, (__gnu_cxx::_Lock_policy)2>::reset() /usr/include/c++/10/bits/shared_ptr_base.h:1306
    #23 0x6352bb2ec79e in Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc()::{lambda()#1}::operator()() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:292
    #24 0x6352bb2ec79e in boost::detail::function::void_function_obj_invoker0<Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc()::{lambda()#1}, void>::invoke(boost::detail::function::function_buffer&) /opt/grassvalley/dev/boost_1_74_0/boost/function/function_template.hpp:158
    #25 0x6352bb2eef9b in boost::function0<void>::operator()() const /opt/grassvalley/dev/boost_1_74_0/boost/function/function_template.hpp:763
    #26 0x6352bb2eef9b in boost::scope_exit::aux::guard<void>::~guard() /opt/grassvalley/dev/boost_1_74_0/boost/scope_exit.hpp:714
    #27 0x6352bb2eef9b in Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:287
    #28 0x7e4fa79f6a21 in Miranda::Vy::Thread::threadEntryFunc(std::shared_ptr<Miranda::Vy::Thread::Context>&) (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x7f6a21) (BuildId: f1b5f579ad00070a)
    #29 0x7e4fa79f78eb  (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x7f78eb) (BuildId: f1b5f579ad00070a)
    #30 0x7e4fa7a753c4  (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x8753c4) (BuildId: f1b5f579ad00070a)
    #31 0x7e4faac5ea41 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
    #32 0x7e4fa6a9caa3  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x9caa3) (BuildId: 282c2c16e7b6600b0b22ea0c99010d2795752b5f)
    #33 0x7e4fa6b29c3b  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x129c3b) (BuildId: 282c2c16e7b6600b0b22ea0c99010d2795752b5f)

0x7e4df2085dd0 is located 1488 bytes inside of 11010944-byte region [0x7e4df2085800,0x7e4df2b05b80)
freed by thread T86 (20000000-0000-0) here:
    #0 0x7e4faacfc4d8 in free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:52
    #1 0x7e4f7d91136d in ofi_freealign include/unix/osd.h:122
    #2 0x7e4f7d91136d in ofi_bufpool_region_free prov/util/src/util_buf.c:130
    #3 0x7e4f7d91136d in ofi_bufpool_region_free prov/util/src/util_buf.c:116
    #4 0x7e4f7d91136d in ofi_bufpool_destroy prov/util/src/util_buf.c:295
    #5 0x7e4f7d855407 in efa_rdm_ep_destroy_buffer_pools prov/efa/src/rdm/efa_rdm_ep_fiops.c:751
    #6 0x7e4f7d855407 in efa_rdm_ep_close prov/efa/src/rdm/efa_rdm_ep_fiops.c:999
    #7 0x6352bb2826d7 in fi_close /opt/grassvalley/dev/libfabric_2.2.0_linux/include/rdma/fabric.h:644
    #8 0x6352bb2826d7 in Miranda::Mocha::FrameCache::FabricInterfaceDeleter<fid_ep>::operator()(fid_ep*) /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/RdmaUtils.h:29
    #9 0x6352bb2826d7 in std::_Sp_counted_deleter<fid_ep*, Miranda::Mocha::FrameCache::FabricInterfaceDeleter<fid_ep>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/include/c++/10/bits/shared_ptr_base.h:474
    #10 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:158
    #11 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:151
    #12 0x6352bb2bf35f in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/10/bits/shared_ptr_base.h:736
    #13 0x6352bb2bf35f in std::__shared_ptr<fid_ep, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/10/bits/shared_ptr_base.h:1188
    #14 0x6352bb2bf35f in std::shared_ptr<fid_ep>::~shared_ptr() /usr/include/c++/10/bits/shared_ptr.h:121
    #15 0x6352bb2bf35f in Miranda::Mocha::FrameCache::EfaEndpoint::~EfaEndpoint() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/EfaEndpoint.h:24
    #16 0x6352bb2bf35f in void __gnu_cxx::new_allocator<Miranda::Mocha::FrameCache::EfaEndpoint>::destroy<Miranda::Mocha::FrameCache::EfaEndpoint>(Miranda::Mocha::FrameCache::EfaEndpoint*) /usr/include/c++/10/ext/new_allocator.h:162
    #17 0x6352bb2bf35f in void std::allocator_traits<std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint> >::destroy<Miranda::Mocha::FrameCache::EfaEndpoint>(std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint>&, Miranda::Mocha::FrameCache::EfaEndpoint*) /usr/include/c++/10/bits/alloc_traits.h:531
    #18 0x6352bb2bf35f in std::_Sp_counted_ptr_inplace<Miranda::Mocha::FrameCache::EfaEndpoint, std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() /usr/include/c++/10/bits/shared_ptr_base.h:560
    #19 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:158
    #20 0x6352bb1ba659 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() /usr/include/c++/10/bits/shared_ptr_base.h:151
    #21 0x6352bb2ec79e in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count() /usr/include/c++/10/bits/shared_ptr_base.h:736
    #22 0x6352bb2ec79e in std::__shared_ptr<Miranda::Mocha::FrameCache::EfaEndpoint, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /usr/include/c++/10/bits/shared_ptr_base.h:1188
    #23 0x6352bb2ec79e in std::__shared_ptr<Miranda::Mocha::FrameCache::EfaEndpoint, (__gnu_cxx::_Lock_policy)2>::reset() /usr/include/c++/10/bits/shared_ptr_base.h:1306
    #24 0x6352bb2ec79e in Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc()::{lambda()#1}::operator()() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:292
    #25 0x6352bb2ec79e in boost::detail::function::void_function_obj_invoker0<Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc()::{lambda()#1}, void>::invoke(boost::detail::function::function_buffer&) /opt/grassvalley/dev/boost_1_74_0/boost/function/function_template.hpp:158
    #26 0x6352bb2eef9b in boost::function0<void>::operator()() const /opt/grassvalley/dev/boost_1_74_0/boost/function/function_template.hpp:763
    #27 0x6352bb2eef9b in boost::scope_exit::aux::guard<void>::~guard() /opt/grassvalley/dev/boost_1_74_0/boost/scope_exit.hpp:714
    #28 0x6352bb2eef9b in Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:287
    #29 0x7e4fa79f6a21 in Miranda::Vy::Thread::threadEntryFunc(std::shared_ptr<Miranda::Vy::Thread::Context>&) (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x7f6a21) (BuildId: f1b5f579ad00070a)
    #30 0x7e4fa79f78eb  (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x7f78eb) (BuildId: f1b5f579ad00070a)
    #31 0x7e4fa7a753c4  (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x8753c4) (BuildId: f1b5f579ad00070a)
    #32 0x7e4faac5ea41 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
    #33 0x7e4fa6a9caa3  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x9caa3) (BuildId: 282c2c16e7b6600b0b22ea0c99010d2795752b5f)
    #34 0x7e4fa6b29c3b  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x129c3b) (BuildId: 282c2c16e7b6600b0b22ea0c99010d2795752b5f)

previously allocated by thread T86 (20000000-0000-0) here:
    #0 0x7e4faacfcf1d in posix_memalign ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
    #1 0x7e4f7d91027b in ofi_memalign include/unix/osd.h:117
    #2 0x7e4f7d91027b in ofi_bufpool_region_alloc prov/util/src/util_buf.c:111
    #3 0x7e4f7d91027b in ofi_bufpool_grow prov/util/src/util_buf.c:161
    #4 0x7e4f7d84eae7 in ofi_buf_alloc include/ofi_mem.h:526
    #5 0x7e4f7d84eae7 in efa_rdm_ep_alloc_txe prov/efa/src/rdm/efa_rdm_ep_utils.c:332
    #6 0x7e4f7d86a54d in efa_rdm_msg_generic_send prov/efa/src/rdm/efa_rdm_msg.c:181
    #7 0x7e4f7d86a54d in efa_rdm_msg_sendv prov/efa/src/rdm/efa_rdm_msg.c:280
    #8 0x6352bb2ebbdc in fi_sendv /opt/grassvalley/dev/libfabric_2.2.0_linux/include/rdma/fi_endpoint.h:334
    #9 0x6352bb2ebbdc in Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::handleEfaXfers(Miranda::Mocha::FrameCache::RdmaSge const&, std::shared_ptr<Miranda::Mocha::FrameCache::EfaPoller> const&) /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:465
    #10 0x6352bb2ee283 in Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc() /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:427
    #11 0x7e4fa79f6a21 in Miranda::Vy::Thread::threadEntryFunc(std::shared_ptr<Miranda::Vy::Thread::Context>&) (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x7f6a21) (BuildId: f1b5f579ad00070a)
    #12 0x7e4fa79f78eb  (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x7f78eb) (BuildId: f1b5f579ad00070a)
    #13 0x7e4fa7a753c4  (/home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0+0x8753c4) (BuildId: f1b5f579ad00070a)
    #14 0x7e4faac5ea41 in asan_thread_start ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
    #15 0x7e4fa6a9caa3  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x9caa3) (BuildId: 282c2c16e7b6600b0b22ea0c99010d2795752b5f)
    #16 0x7e4fa6b29c3b  (/usr/lib/x86_64-linux-gnu/libc.so.6+0x129c3b) (BuildId: 282c2c16e7b6600b0b22ea0c99010d2795752b5f)

SUMMARY: AddressSanitizer: heap-use-after-free include/ofi_list.h:97 in dlist_remove
Shadow bytes around the buggy address:
  0x7e4df2085b00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085b80: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085c00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085c80: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085d00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
=>0x7e4df2085d80: fd fd fd fd fd fd fd fd fd fd[fd]fd fd fd fd fd
  0x7e4df2085e00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085e80: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085f00: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2085f80: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x7e4df2086000: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==44938==ABORTING

I'm not sure how to interpret the results. It looks like we destroyed the bufpool when closing one endpoint, which was then written to when we tried to close another endpoint and flush its enqueued operations. I briefly checked and the bufpool are per-endpoint so not sure how they could intertwined like that.

For the 3rd scenario, the backtrace is following:

Thread 6 (Thread 0x77350b977000 (LWP 37360) "20000000-0000-0"):
#0  0x00007737e78ceb4c in efa_domain_progress_rdm_peers_and_queues (domain=0x516000019580) at prov/efa/src/efa_domain.c:901
#1  0x00007737e7901dc5 in efa_rdm_ep_wait_send (efa_rdm_ep=0x51a000254a80) at prov/efa/src/rdm/efa_rdm_ep_fiops.c:845
#2  efa_rdm_ep_close (fid=0x51a000254a80) at prov/efa/src/rdm/efa_rdm_ep_fiops.c:929
#3  0x0000571f6cc776d8 in fi_close (fid=<optimized out>) at /opt/grassvalley/dev/libfabric_2.2.0_linux/include/rdma/fabric.h:644
#4  Miranda::Mocha::FrameCache::FabricInterfaceDeleter<fid_ep>::operator() (in_pointer=<optimized out>, this=<optimized out>) at /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/RdmaUtils.h:29
#5  std::_Sp_counted_deleter<fid_ep*, Miranda::Mocha::FrameCache::FabricInterfaceDeleter<fid_ep>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:474
#6  0x0000571f6cbaf65a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5030004d35f0) at /usr/include/c++/10/bits/shared_ptr_base.h:158
#7  std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5030004d35f0) at /usr/include/c++/10/bits/shared_ptr_base.h:151
#8  0x0000571f6ccb4360 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x50c0001f4248, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736
#9  std::__shared_ptr<fid_ep, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x50c0001f4240, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188
#10 std::shared_ptr<fid_ep>::~shared_ptr (this=0x50c0001f4240, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr.h:121
#11 Miranda::Mocha::FrameCache::EfaEndpoint::~EfaEndpoint (this=0x50c0001f4210, __in_chrg=<optimized out>) at /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/EfaEndpoint.h:24
#12 __gnu_cxx::new_allocator<Miranda::Mocha::FrameCache::EfaEndpoint>::destroy<Miranda::Mocha::FrameCache::EfaEndpoint> (__p=0x50c0001f4210, this=0x50c0001f4210) at /usr/include/c++/10/ext/new_allocator.h:162
#13 std::allocator_traits<std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint> >::destroy<Miranda::Mocha::FrameCache::EfaEndpoint> (__p=0x50c0001f4210, __a=...) at /usr/include/c++/10/bits/alloc_traits.h:531
#14 std::_Sp_counted_ptr_inplace<Miranda::Mocha::FrameCache::EfaEndpoint, std::allocator<Miranda::Mocha::FrameCache::EfaEndpoint>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x50c0001f4200) at /usr/include/c++/10/bits/shared_ptr_base.h:560
#15 0x0000571f6cbaf65a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x50c0001f4200) at /usr/include/c++/10/bits/shared_ptr_base.h:158
#16 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x50c0001f4200) at /usr/include/c++/10/bits/shared_ptr_base.h:151
#17 0x0000571f6cce179f in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:736
#18 std::__shared_ptr<Miranda::Mocha::FrameCache::EfaEndpoint, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1188
#19 std::__shared_ptr<Miranda::Mocha::FrameCache::EfaEndpoint, (__gnu_cxx::_Lock_policy)2>::reset (this=<optimized out>) at /usr/include/c++/10/bits/shared_ptr_base.h:1306
#20 Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc()::{lambda()#1}::operator()() (__closure=<optimized out>) at /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:292
#21 boost::detail::function::void_function_obj_invoker0<Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc()::{lambda()#1}, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...) at /opt/grassvalley/dev/boost_1_74_0/boost/function/function_template.hpp:158
#22 0x0000571f6cce3f9c in boost::function0<void>::operator() (this=0x77350b9722f0) at /opt/grassvalley/dev/boost_1_74_0/boost/function/function_template.hpp:763
#23 boost::scope_exit::aux::guard<void>::~guard (this=0x77350b9722f0, __in_chrg=<optimized out>) at /opt/grassvalley/dev/boost_1_74_0/boost/scope_exit.hpp:714
#24 Miranda::Mocha::FrameCache::FlowOutputEfaV1::Peer::sendingThreadFunc (this=<optimized out>) at /var/ecs-builder/GVE/Mocha/native/MocFrameCache/src/FlowOutputEfaV1.cpp:287
#25 0x00007738119f6a22 in Miranda::Vy::Thread::threadEntryFunc(std::shared_ptr<Miranda::Vy::Thread::Context>&) () from /home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0
#26 0x00007738119f78ec in ?? () from /home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0
#27 0x0000773811a753c5 in ?? () from /home/ubuntu/framecache/libVyCore.so.Mocha.3.3.101.0
#28 0x0000773814c5ea42 in asan_thread_start (arg=0x77350b978000) at ../../../../src/libsanitizer/asan/asan_interceptors.cpp:234
#29 0x0000773810a9caa4 in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6
#30 0x0000773810b29c3c in ?? () from /usr/lib/x86_64-linux-gnu/libc.so.6

I only noticed that ope->window is 0 when I was able to break into this function. efa_outstanding_tx_ops for the endpoint that is stucked showed 23. I captured perf recording that shows it takes most of the CPU time (and doesn't let go):

Image

Environment:
Ubuntu 24.04, libfabric revision 0fd65de (origin/main from yesterday)
Applications are run on c5n.9xlarge instances, that have "older" EFA without RDMA. There is at least 20 endpoints, each taking 2.2 Gbps for a total of ~45 Gbps (out of 50 Gbps).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions