Description
Describe the bug
The UCX rc and cuda_ipc transport modes are not working.
Steps to Reproduce
- ucx_info -v
# Library version: 1.19.0
# Library path: /usr/local/ucx/lib/libucs.so.0
# API headers version: 1.19.0
# Git branch 'v1.19.x', revision 71a4b63
# Configured with: --prefix=/usr/local/ucx --enable-shared --disable-static --disable-doxygen-doc --enable-optimizations --enable-cma --enable-devel-headers --with-cuda=/usr/local/cuda --with-verbs --with-dm --enable-mt
- test
server: UCX_LOG_LEVEL=debug UCX_TLS=rc,cuda_ipc ucx_perftest -c 0
client : UCX_TLS=rc,cuda_ipc ucx_perftest 127.0.0.1 -t tag_bw -c 1 -s 1024 -m cuda
- log
server log:
[1750907562.762849] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] debug.c:1158 UCX DEBUG using signal stack 0x7ffbc7273000 size 141824
[1750907562.778816] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] init.c:121 UCX DEBUG /usr/local/ucx/lib/libucs.so.0 loaded at 0x7ffbc8852000
[1750907562.778850] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] init.c:122 UCX DEBUG cmd line: ucx_perftest -c 0
[1750907562.778861] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:72 UCX DEBUG ucs library path: /usr/local/ucx/lib/libucs.so.0
[1750907562.778866] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:304 UCX DEBUG loading modules for ucs
[1750907562.778914] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:304 UCX DEBUG loading modules for ucx_perftest
Waiting for connection...
Accepted connection from 127.0.0.1:22464
+----------------------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: tag match bandwidth |
| Data layout: (automatic) |
| Send memory: cuda |
| Recv memory: cuda |
| Message size: 1024 |
| Window size: 32 |
+----------------------------------------------------------------------------------------------------------+
[1750907576.341498] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] libperf.c:2153 UCX DEBUG set send allocator by send mem type cuda
[1750907576.341506] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] libperf.c:2157 UCX DEBUG set recv allocator by recv mem type cuda
[1750907578.121062] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] time.c:22 UCX DEBUG arch clock frequency: 2500000000.00 Hz
[1750907578.121262] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2339 UCX INFO Version 1.19.0 (loaded from /usr/local/ucx/lib/libucp.so.0)
[1750907578.121274] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2088 UCX DEBUG estimated number of endpoints is 1
[1750907578.121278] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2095 UCX DEBUG estimated number of endpoints per node is 1
[1750907578.121283] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2105 UCX DEBUG estimated bcopy bandwidth is 7340032000.000000
[1750907578.121292] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2164 UCX DEBUG allocation method[0] is md 'sysv'
[1750907578.121295] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2164 UCX DEBUG allocation method[1] is md 'posix'
[1750907578.121302] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2176 UCX DEBUG allocation method[2] is 'thp'
[1750907578.121305] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2164 UCX DEBUG allocation method[3] is md '*'
[1750907578.121308] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2176 UCX DEBUG allocation method[4] is 'mmap'
[1750907578.121311] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2176 UCX DEBUG allocation method[5] is 'heap'
[1750907578.121334] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:304 UCX DEBUG loading modules for uct
[1750907578.123945] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:304 UCX DEBUG loading modules for uct_cuda
[1750907578.124751] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:304 UCX DEBUG loading modules for uct_ib
[1750907578.125972] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:304 UCX DEBUG added sys_dev 0 for bus id 17:00.0
[1750907578.126026] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:304 UCX DEBUG added sys_dev 1 for bus id 1a:00.0
[1750907578.126275] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] sys.c:440 UCX DEBUG failed to open /proc/sys/kernel/yama/ptrace_scope: No such file or directory
[1750907578.126282] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] cma_md.c:69 UCX DEBUG could not read '/proc/sys/kernel/yama/ptrace_scope' - assuming Yama security is not enforced
[1750907578.126358] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md self because it has no selected transport resources
[1750907578.126638] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] tcp_iface.c:980 UCX DEBUG filtered out bridge device docker0
[1750907578.126865] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:935 UCX DEBUG /sys/class/net/lo: sysfs path undetected
[1750907578.126872] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:564 UCX DEBUG lo: system device unknown
[1750907578.127715] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:939 UCX DEBUG /sys/class/net/xgbe0: PF sysfs path is '/sys/devices/pci0000:54/0000:54:00.0/0000:55:00.0'
[1750907578.127799] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:304 UCX DEBUG added sys_dev 2 for bus id 55:00.0
[1750907578.127808] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:560 UCX DEBUG xgbe0: bdf_name 0000:55:00.0 sys_dev 2
[1750907578.129620] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md tcp because it has no selected transport resources
[1750907578.129708] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md sysv because it has no selected transport resources
[1750907578.129820] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md posix because it has no selected transport resources
[1750907578.129857] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] cuda_copy_md.c:111 UCX DEBUG dmabuf is not supported on cuda device 0
[1750907578.129904] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md cuda_cpy because it has no selected transport resources
[1750907578.129925] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] cuda_ipc_md.c:515 UCX DEBUG multi-node NVLINK support is disabled
[1750907578.136048] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:939 UCX DEBUG /sys/class/infiniband/mlx5_0: PF sysfs path is '/sys/devices/pci0000:10/0000:10:00.0/0000:11:00.0/0000:12:10.0/0000:1c:00.0'
[1750907578.136099] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:304 UCX DEBUG added sys_dev 3 for bus id 1c:00.0
[1750907578.136105] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:560 UCX DEBUG mlx5_0: bdf_name 0000:1c:00.0 sys_dev 3
[1750907578.136135] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:533 UCX DEBUG mlx5_0: vendor_id 0x15b3 device_id 4119
[1750907578.136544] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5.h:1000 UCX DEBUG mlx5dv_devx_general_cmd(QUERY_HCA_CAP, CAP2) failed on mlx5_0, syndrome 0x5add95: Remote I/O error
[1750907578.136553] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1939 UCX DEBUG mlx5_0: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.136715] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1728 UCX DEBUG mlx5_0: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.136922] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:248 UCX DEBUG added async handler 0x18eb7b0 [id=35 ref 1] ???() to hash
[1750907578.137052] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:522 UCX DEBUG listening to async event fd 35 events 0x1 mode thread_spinlock
[1750907578.137062] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:645 UCX DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1750907578.137079] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_0: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.137090] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_0: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.137105] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:138 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.137236] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2445 UCX DEBUG mlx5_0: opened DEVX md log_max_qp=18
[1750907578.137572] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_0: KSM dm memory registration status "Success" range 0x7ffbbc05d000..0x7ffbbc05d020 iova 0x0 mkey_index 0x0
[1750907578.137674] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:115 UCX DEBUG mlx5_0: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc05d000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.137680] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_0: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc05d000 atomic mkey_index 0x0
[1750907578.137939] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1203 UCX DEBUG mlx5_0: relaxed order memory access is disabled
[1750907578.138109] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_0: KSM flush-mr memory registration status "Success" range 0x1948000..0x1948008 iova 0x0 mkey_index 0x0
[1750907578.138114] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2498 UCX DEBUG mlx5_0: XGVMI is not supported
[1750907578.138119] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1063 UCX DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.144564] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:939 UCX DEBUG /sys/class/infiniband/mlx5_1: PF sysfs path is '/sys/devices/pci0000:54/0000:54:00.0/0000:55:00.0'
[1750907578.144576] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:560 UCX DEBUG mlx5_1: bdf_name 0000:55:00.0 sys_dev 2
[1750907578.144594] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:533 UCX DEBUG mlx5_1: vendor_id 0x15b3 device_id 4121
[1750907578.145096] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1954 UCX DEBUG mlx5_1: mkey_by_name_reserve is not supported
[1750907578.145101] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1939 UCX DEBUG mlx5_1: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.145231] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1728 UCX DEBUG mlx5_1: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.145384] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:248 UCX DEBUG added async handler 0x194b010 [id=40 ref 1] ???() to hash
[1750907578.145397] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:522 UCX DEBUG listening to async event fd 40 events 0x1 mode thread_spinlock
[1750907578.145401] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:645 UCX DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1750907578.145410] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_1: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.145416] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_1: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.145420] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:138 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.145549] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2445 UCX DEBUG mlx5_1: opened DEVX md log_max_qp=18
[1750907578.145969] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_1: KSM dm memory registration status "Success" range 0x7ffbbc052000..0x7ffbbc052020 iova 0x0 mkey_index 0x0
[1750907578.146074] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:115 UCX DEBUG mlx5_1: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc052000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.146080] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_1: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc052000 atomic mkey_index 0x0
[1750907578.146426] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1203 UCX DEBUG mlx5_1: relaxed order memory access is disabled
[1750907578.146594] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_1: KSM flush-mr memory registration status "Success" range 0x194d000..0x194d008 iova 0x0 mkey_index 0x0
[1750907578.146602] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2498 UCX DEBUG mlx5_1: XGVMI is not supported
[1750907578.146605] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1063 UCX DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.147977] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md mlx5_1 because it has no selected transport resources
[1750907578.147986] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2535 UCX DEBUG mlx5_1: md=0x18f5f30 md->flags=0x3f101ab flush_rkey=0x6000
[1750907578.148161] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1750907578.148175] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:664 UCX DEBUG destroying ib device mlx5_1
[1750907578.148186] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:173 UCX DEBUG removed async handler 0x194b010 [id=40 ref 1] ???() from hash
[1750907578.148190] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:576 UCX DEBUG removing async handler 0x194b010 [id=40 ref 1] ???()
[1750907578.148195] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:188 UCX DEBUG release async handler 0x194b010 [id=40 ref 0] ???()
[1750907578.156608] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:939 UCX DEBUG /sys/class/infiniband/mlx5_2: PF sysfs path is '/sys/devices/pci0000:54/0000:54:00.0/0000:55:00.1'
[1750907578.156650] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:304 UCX DEBUG added sys_dev 4 for bus id 55:00.1
[1750907578.156654] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:560 UCX DEBUG mlx5_2: bdf_name 0000:55:00.1 sys_dev 4
[1750907578.156668] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:533 UCX DEBUG mlx5_2: vendor_id 0x15b3 device_id 4121
[1750907578.157162] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1954 UCX DEBUG mlx5_2: mkey_by_name_reserve is not supported
[1750907578.157168] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1939 UCX DEBUG mlx5_2: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.157301] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1728 UCX DEBUG mlx5_2: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.157453] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:248 UCX DEBUG added async handler 0x1947fa0 [id=40 ref 1] ???() to hash
[1750907578.157464] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:522 UCX DEBUG listening to async event fd 40 events 0x1 mode thread_spinlock
[1750907578.157468] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:645 UCX DEBUG initialized device 'mlx5_2' (InfiniBand channel adapter) with 1 ports
[1750907578.157477] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_2: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.157483] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_2: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.157488] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:138 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.157610] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2445 UCX DEBUG mlx5_2: opened DEVX md log_max_qp=18
[1750907578.157924] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_2: KSM dm memory registration status "Success" range 0x7ffbbc052000..0x7ffbbc052020 iova 0x0 mkey_index 0x0
[1750907578.158023] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:115 UCX DEBUG mlx5_2: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc052000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.158028] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_2: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc052000 atomic mkey_index 0x0
[1750907578.158249] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1203 UCX DEBUG mlx5_2: relaxed order memory access is disabled
[1750907578.158421] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_2: KSM flush-mr memory registration status "Success" range 0x19a4000..0x19a4008 iova 0x0 mkey_index 0x0
[1750907578.158425] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2498 UCX DEBUG mlx5_2: XGVMI is not supported
[1750907578.158429] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1063 UCX DEBUG mlx5_2: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.159247] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md mlx5_2 because it has no selected transport resources
[1750907578.159256] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2535 UCX DEBUG mlx5_2: md=0x194ba70 md->flags=0x3f101ab flush_rkey=0x44500
[1750907578.159419] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1750907578.159423] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:664 UCX DEBUG destroying ib device mlx5_2
[1750907578.159428] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:173 UCX DEBUG removed async handler 0x1947fa0 [id=40 ref 1] ???() from hash
[1750907578.159432] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:576 UCX DEBUG removing async handler 0x1947fa0 [id=40 ref 1] ???()
[1750907578.159437] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:188 UCX DEBUG release async handler 0x1947fa0 [id=40 ref 0] ???()
[1750907578.169482] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:939 UCX DEBUG /sys/class/infiniband/mlx5_3: PF sysfs path is '/sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:14.0/0000:70:00.0'
[1750907578.169528] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:304 UCX DEBUG added sys_dev 5 for bus id 70:00.0
[1750907578.169533] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] topo.c:560 UCX DEBUG mlx5_3: bdf_name 0000:70:00.0 sys_dev 5
[1750907578.169550] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:533 UCX DEBUG mlx5_3: vendor_id 0x15b3 device_id 4119
[1750907578.170421] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5.h:1000 UCX DEBUG mlx5dv_devx_general_cmd(QUERY_HCA_CAP, CAP2) failed on mlx5_3, syndrome 0x5add95: Remote I/O error
[1750907578.170430] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1939 UCX DEBUG mlx5_3: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.170613] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:1728 UCX DEBUG mlx5_3: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.170812] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:248 UCX DEBUG added async handler 0x19a7010 [id=40 ref 1] ???() to hash
[1750907578.170825] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:522 UCX DEBUG listening to async event fd 40 events 0x1 mode thread_spinlock
[1750907578.170829] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:645 UCX DEBUG initialized device 'mlx5_3' (InfiniBand channel adapter) with 1 ports
[1750907578.170842] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_3: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.170849] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1219 UCX DEBUG mlx5_3: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.170854] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:138 UCX DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.170993] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2445 UCX DEBUG mlx5_3: opened DEVX md log_max_qp=18
[1750907578.171337] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_3: KSM dm memory registration status "Success" range 0x7ffbbc052000..0x7ffbbc052020 iova 0x0 mkey_index 0x0
[1750907578.171423] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:115 UCX DEBUG mlx5_3: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc052000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.171429] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_3: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc052000 atomic mkey_index 0x0
[1750907578.171667] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1203 UCX DEBUG mlx5_3: relaxed order memory access is disabled
[1750907578.171830] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:145 UCX DEBUG mlx5_3: KSM flush-mr memory registration status "Success" range 0x199d000..0x199d008 iova 0x0 mkey_index 0x0
[1750907578.171834] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2498 UCX DEBUG mlx5_3: XGVMI is not supported
[1750907578.171838] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_md.c:1063 UCX DEBUG mlx5_3: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.172748] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md mlx5_3 because it has no selected transport resources
[1750907578.172758] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2535 UCX DEBUG mlx5_3: md=0x19a13b0 md->flags=0x3f1012f flush_rkey=0x4200
[1750907578.172932] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1750907578.172936] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:664 UCX DEBUG destroying ib device mlx5_3
[1750907578.172942] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:173 UCX DEBUG removed async handler 0x19a7010 [id=40 ref 1] ???() from hash
[1750907578.172946] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:576 UCX DEBUG removing async handler 0x19a7010 [id=40 ref 1] ???()
[1750907578.172953] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:188 UCX DEBUG release async handler 0x19a7010 [id=40 ref 0] ???()
[1750907578.173576] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] sys.c:440 UCX DEBUG failed to open /proc/sys/kernel/yama/ptrace_scope: No such file or directory
[1750907578.173580] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] cma_md.c:69 UCX DEBUG could not read '/proc/sys/kernel/yama/ptrace_scope' - assuming Yama security is not enforced
[1750907578.173619] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1662 UCX DEBUG closing md cma because it has no selected transport resources
[1750907578.173631] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1785 UCX DEBUG register host memory on: mlx5_0
[1750907578.173634] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1785 UCX DEBUG register cuda memory on: cuda_ipc, mlx5_0
[1750907578.173638] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering cuda-managed memory
[1750907578.173641] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering rocm memory
[1750907578.173644] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering rocm-managed memory
[1750907578.173648] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering rdma memory
[1750907578.173650] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering ze-host memory
[1750907578.173651] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering ze-device memory
[1750907578.173653] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:1773 UCX DEBUG no memory domain supports registering ze-managed memory
[1750907578.173700] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:138 UCX DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1750907578.173727] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:248 UCX DEBUG added async handler 0x18f2980 [id=39 ref 1] ???() to hash
[1750907578.173737] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:522 UCX DEBUG listening to async event fd 39 events 0x1 mode thread_spinlock
[1750907578.173852] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] module.c:304 UCX DEBUG loading modules for ucm
[1750907578.247965] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ucp_context.c:2415 UCX DEBUG created ucp context perftest 0x18d39e0 [2 mds 5 tls] features 0x3 tl bitmap 0x1f 0x0
[1750907578.248051] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] uct_mem.c:310 UCX DEBUG could not allocate user memory: cuda memory length=1024 flags=0x360 num_methods=1
[1750907578.248063] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] uct_mem.c:310 UCX DEBUG could not allocate user memory: cuda memory length=1024 flags=0x360 num_methods=1
[1750907578.248068] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] uct_mem.c:310 UCX DEBUG could not allocate user memory: cuda memory length=1024 flags=0x360 num_methods=1
[1750907578.248074] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] libperf.c:2064 UCX WARN ucp test failed to allocate memory
[1750907578.248112] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:173 UCX DEBUG removed async handler 0x18f2980 [id=39 ref 1] ???() from hash
[1750907578.248118] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:576 UCX DEBUG removing async handler 0x18f2980 [id=39 ref 1] ???()
[1750907578.248128] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:188 UCX DEBUG release async handler 0x18f2980 [id=39 ref 0] ???()
[1750907578.248158] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] pgtable.c:618 UCX DEBUG purge empty page table
[1750907578.248165] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:194 UCX DEBUG mpool rcache_mp destroyed
[1750907578.248190] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_mlx5dv_md.c:2535 UCX DEBUG mlx5_0: md=0x18f5220 md->flags=0x3f1012f flush_rkey=0x1fca00
[1750907578.248439] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] mpool.c:194 UCX DEBUG mpool devx dbrec destroyed
[1750907578.248448] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] ib_device.c:664 UCX DEBUG destroying ib device mlx5_0
[1750907578.248453] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:173 UCX DEBUG removed async handler 0x18eb7b0 [id=35 ref 1] ???() from hash
[1750907578.248457] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:576 UCX DEBUG removing async handler 0x18eb7b0 [id=35 ref 1] ???()
[1750907578.248556] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] async.c:188 UCX DEBUG release async handler 0x18eb7b0 [id=35 ref 0] ???()
[1750907578.249234] [bddwd-inf-k8s-a100-ab2-0014:7568 :0] perftest_run.c:345 UCX ERROR Failed to run test: Out of memory
client log:
UCX_TLS=rc ucx_perftest 127.0.0.1 -t tag_lat -c 1 -s 809600000 -m cuda
+--------------+--------------+------------------------------+---------------------+-----------------------+
| | | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
| Stage | # iterations | 50.0%ile | average | overall | average | overall | average | overall |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1750906436.727419] [bddwd-inf-k8s-a100-ab2-0014:160066:0] libperf.c:2064 UCX WARN ucp test failed to allocate memory
[1750906436.730516] [bddwd-inf-k8s-a100-ab2-0014:160066:0] perftest_run.c:345 UCX ERROR Failed to run test: Out of memory
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
- Ubuntu 18.04.5 LTS \n \
- Linux bddwd-inf-k8s-a100-ab2-0014.bddwd.com 5.10.0-1.0.0.28 Add basic types and functions, initial makefile, and smoke test. #1 SMP Mon Jun 5 02:20:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- For RDMA/IB/RoCE related issues:
- Driver version:
- ofed_info -s MLNX_OFED_LINUX-5.0-2.1.8.0:
- bv_devinfo -vv
- Driver version:
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.28.1002
node_guid: 08c0:eb03:00dd:ee46
sys_image_guid: 08c0:eb03:00dd:ee46
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffffffffff000
max_qp: 262144
max_qp_wr: 32768
device_cap_flags: 0xe97e1c36
BAD_PKEY_CNTR
BAD_QKEY_CNTR
AUTO_PATH_MIG
CHANGE_PHY_PORT
PORT_ACTIVE_EVENT
SYS_IMAGE_GUID
RC_RNR_NAK_GEN
MEM_WINDOW
UD_IP_CSUM
XRC
MEM_MGT_EXTENSIONS
MEM_WINDOW_TYPE_2B
MANAGED_FLOW_STEERING
Unknown flags: 0xC8480000
max_sge: 30
max_sge_rd: 30
max_cq: 16777216
max_cqe: 4194303
max_mr: 16777216
max_pd: 16777216
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom: 4194304
max_qp_init_rd_atom: 16
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd: 0
max_mw: 16777216
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 2097152
max_mcast_qp_attach: 240
max_total_mcast_qp_attach: 503316480
max_ah: 2147483647
max_fmr: 0
max_srq: 8388608
max_srq_wr: 32767
max_srq_sge: 31
max_pkeys: 128
local_ca_ack_delay: 16
general_odp_caps:
ODP_SUPPORT
ODP_SUPPORT_IMPLICIT
rc_odp_caps:
SUPPORT_SEND
SUPPORT_RECV
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
SUPPORT_SEND
xrc_odp_caps:
SUPPORT_SEND
SUPPORT_WRITE
SUPPORT_READ
SUPPORT_SRQ
completion timestamp_mask: 0x7fffffffffffffff
hca_core_clock: 78125kHZ
device_cap_flags_ex: 0x30000051E97E1C36
PCI_WRITE_END_PADDING
Unknown flags: 0x3000004100000000
tso_caps:
max_tso: 0
rss_caps:
max_rwq_indirection_tables: 0
max_rwq_indirection_table_size: 0
rx_hash_function: 0x0
rx_hash_fields_mask: 0x0
max_wq_type_rq: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
max_rndv_hdr_size: 64
max_num_tags: 127
max_ops: 32768
max_sge: 1
flags:
IBV_TM_CAP_RC
- For GPU related issues:
- GPU type
- Cuda:
- nvcc --version
- nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
- Check if peer-direct is loaded:
lsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
- no output
- nvcc --version
Additional information (depending on the issue)
- nvidia-smi nvlink --status
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-0c15878b-9b12-3744-97ed-e25adeac57c3)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-36ea1ca4-4d0d-e206-5795-ab3c326154ef)
Link 0: 25 GB/s
Link 1: 25 GB/s
Link 2: 25 GB/s
Link 3: 25 GB/s
Link 4: 25 GB/s
Link 5: 25 GB/s
Link 6: 25 GB/s
Link 7: 25 GB/s
Link 8: 25 GB/s
Link 9: 25 GB/s
Link 10: 25 GB/s
Link 11: 25 GB/s - p2pBandwidthLatencyTest
- [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100-SXM4-40GB, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA A100-SXM4-40GB, pciBusID: 1a, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
- [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1212.18 9.58
1 8.30 1302.08
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1287.07 272.76
1 276.54 1298.84
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1308.63 10.13
1 10.07 625.25
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1299.92 326.65
1 472.64 627.51
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.73 27.77
1 28.56 2.21
CPU 0 1
0 3.23 9.78
1 11.16 3.16
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.72 3.46
1 3.54 2.21
CPU 0 1
0 3.10 2.58
1 2.63 4.68