-
Notifications
You must be signed in to change notification settings - Fork 456
Open
Labels
Description
After creating endpoints and exchanging addresses, registering memory for FI_REMOTE_READ and exchanging raw MR keys,
The application trying to read crashes with the following backtrace.
Using libfabric version 2.3 on MLX NDR200 NICs
backtrace
#0 0x00007ff11ca4854c in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007ff11c9fbd46 in raise () from /lib64/libc.so.6
#2 0x00007ff11c9cf7f3 in abort () from /lib64/libc.so.6
#3 0x00007ff11eb889d1 in ucs_fatal_error_message (file=0x243b8b <error: Cannot access memory at address 0x243b8b>, line=2382514, function=0x6 <error: Cannot access memory at address 0x6>, message_buf=0x7ff11ca4854c <__pthread_kill_implementation+284> "\211\305\367\335=") at debug/assert.c:38
#4 0x00007ff11eb8e12c in ucs_log_default_handler (file=0x243b8b <error: Cannot access memory at address 0x243b8b>, line=2382514, function=0x6 <error: Cannot access memory at address 0x6>, level=480544076, comp_conf=0x7ff0986055e0, format=0x12 <error: Cannot access memory at address 0x12>, ap=0x7ff098607d00) at debug/log.c:317
#5 0x00007ff11eb8d15c in ucs_log_dispatch (file=0x243b8b <error: Cannot access memory at address 0x243b8b>, line=2382514, function=0x6 <error: Cannot access memory at address 0x6>, level=480544076, comp_conf=0x7ff0986055e0, format=0x12 <error: Cannot access memory at address 0x12>) at debug/log.c:381
#6 0x00007ff015ca7ea0 in uct_ib_mlx5_completion_with_err (iface=0x243b8b, ecqe=0x245ab2, txwq=0x6, log_level=480544076) at mlx5/ib_mlx5_log.c:171
#7 0x00007ff015cd9e61 in uct_rc_mlx5_iface_handle_failure (ib_iface=0x243b8b, arg=0x245ab2, ep_status=6) at rc/accel/rc_mlx5_iface.c:299
#8 0x00007ff015caa5b2 in uct_ib_mlx5_check_completion_with_err (iface=0x243b8b, cq=0x245ab2, cqe=0x6) at mlx5/ib_mlx5.c:496
#9 0x00007ff015ca8cab in uct_ib_mlx5_check_completion (iface=0x243b8b, cq=0x245ab2, cqe=0x6, flags=480544076) at mlx5/ib_mlx5.c:477
#10 0x00007ff015cd3445 in uct_ib_mlx5_poll_cq (iface=<optimized out>, cq=0x62d000b98bf8, poll_flags=<optimized out>, check_cqe_cb=<optimized out>) at /apps/GPP/UCX/SRC/ucx-1.15.0/src/uct/ib/mlx5/ib_mlx5.inl:146
#11 uct_rc_mlx5_iface_poll_tx (iface=<optimized out>, poll_flags=<optimized out>) at rc/accel/rc_mlx5_iface.c:149
#12 uct_rc_mlx5_iface_progress (arg=<optimized out>, flags=<optimized out>) at rc/accel/rc_mlx5_iface.c:189
#13 uct_rc_mlx5_iface_progress_cyclic (arg=0x243b8b) at rc/accel/rc_mlx5_iface.c:194
#14 0x00007ff11ecc7158 in ucs_callbackq_dispatch (cbq=0x6) at /apps/GPP/UCX/SRC/ucx-1.15.0/src/ucs/datastruct/callbackq.h:211
#15 uct_worker_progress (worker=<optimized out>) at /apps/GPP/UCX/SRC/ucx-1.15.0/src/uct/api/uct.h:2777
#16 ucp_worker_progress (worker=0x243b8b) at core/ucp_worker.c:2889
#17 0x00007ff11f1757c0 in ucx_ep_progress () from /gpfs/scratch/ehpc01/de340/maestro/sep09/install/lib/libfabric.so.1
#18 0x00007ff11ef960b4 in ofi_cq_progress () from /gpfs/scratch/ehpc01/de340/maestro/sep09/install/lib/libfabric.so.1
#19 0x00007ff11ef945f5 in ofi_cq_readfrom () from /gpfs/scratch/ehpc01/de340/maestro/sep09/install/lib/libfabric.so.1
#20 0x00007ff11f44c5ba in fi_cq_read (cq=0x243b8b, buf=0x245ab2, count=1) at /gpfs/scratch/ehpc01/de340/maestro/sep09/install/include/rdma/fi_eq.h:400
#21 mstro_ofi__check_and_handle_cq (ep=0x620000000098, incoming_msg_handler=<optimized out>) at ofi.c:2917