Skip to content

The UCX rc and cuda_ipc transport modes are not working. #10743

Open
@tienfeek

Description

@tienfeek

Describe the bug

The UCX rc and cuda_ipc transport modes are not working.

Steps to Reproduce

  • ucx_info -v
# Library version: 1.19.0
# Library path: /usr/local/ucx/lib/libucs.so.0
# API headers version: 1.19.0
# Git branch 'v1.19.x', revision 71a4b63
# Configured with: --prefix=/usr/local/ucx --enable-shared --disable-static --disable-doxygen-doc --enable-optimizations --enable-cma --enable-devel-headers --with-cuda=/usr/local/cuda --with-verbs --with-dm --enable-mt
  • test
  server: UCX_LOG_LEVEL=debug UCX_TLS=rc,cuda_ipc  ucx_perftest -c 0
  client : UCX_TLS=rc,cuda_ipc ucx_perftest 127.0.0.1 -t tag_bw -c 1 -s 1024 -m cuda
  • log
    server log:
[1750907562.762849] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           debug.c:1158 UCX  DEBUG using signal stack 0x7ffbc7273000 size 141824
[1750907562.778816] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            init.c:121  UCX  DEBUG /usr/local/ucx/lib/libucs.so.0 loaded at 0x7ffbc8852000
[1750907562.778850] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            init.c:122  UCX  DEBUG cmd line: ucx_perftest -c 0 
[1750907562.778861] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:72   UCX  DEBUG ucs library path: /usr/local/ucx/lib/libucs.so.0
[1750907562.778866] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:304  UCX  DEBUG loading modules for ucs
[1750907562.778914] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:304  UCX  DEBUG loading modules for ucx_perftest
Waiting for connection...
Accepted connection from 127.0.0.1:22464
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         tag match bandwidth                                                                        |
| Data layout:  (automatic)                                                                                |
| Send memory:  cuda                                                                                       |
| Recv memory:  cuda                                                                                       |
| Message size: 1024                                                                                       |
| Window size:  32                                                                                         |
+----------------------------------------------------------------------------------------------------------+
[1750907576.341498] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         libperf.c:2153 UCX  DEBUG set send allocator by send mem type cuda
[1750907576.341506] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         libperf.c:2157 UCX  DEBUG set recv allocator by recv mem type cuda
[1750907578.121062] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            time.c:22   UCX  DEBUG arch clock frequency: 2500000000.00 Hz
[1750907578.121262] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2339 UCX  INFO  Version 1.19.0 (loaded from /usr/local/ucx/lib/libucp.so.0)
[1750907578.121274] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2088 UCX  DEBUG estimated number of endpoints is 1
[1750907578.121278] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2095 UCX  DEBUG estimated number of endpoints per node is 1
[1750907578.121283] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2105 UCX  DEBUG estimated bcopy bandwidth is 7340032000.000000
[1750907578.121292] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2164 UCX  DEBUG allocation method[0] is md 'sysv'
[1750907578.121295] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2164 UCX  DEBUG allocation method[1] is md 'posix'
[1750907578.121302] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2176 UCX  DEBUG allocation method[2] is 'thp'
[1750907578.121305] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2164 UCX  DEBUG allocation method[3] is md '*'
[1750907578.121308] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2176 UCX  DEBUG allocation method[4] is 'mmap'
[1750907578.121311] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2176 UCX  DEBUG allocation method[5] is 'heap'
[1750907578.121334] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:304  UCX  DEBUG loading modules for uct
[1750907578.123945] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:304  UCX  DEBUG loading modules for uct_cuda
[1750907578.124751] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:304  UCX  DEBUG loading modules for uct_ib
[1750907578.125972] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:304  UCX  DEBUG added sys_dev 0 for bus id 17:00.0
[1750907578.126026] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:304  UCX  DEBUG added sys_dev 1 for bus id 1a:00.0
[1750907578.126275] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]             sys.c:440  UCX  DEBUG failed to open /proc/sys/kernel/yama/ptrace_scope: No such file or directory
[1750907578.126282] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          cma_md.c:69   UCX  DEBUG could not read '/proc/sys/kernel/yama/ptrace_scope' - assuming Yama security is not enforced
[1750907578.126358] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md self because it has no selected transport resources
[1750907578.126638] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       tcp_iface.c:980  UCX  DEBUG filtered out bridge device docker0
[1750907578.126865] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:935  UCX  DEBUG /sys/class/net/lo: sysfs path undetected
[1750907578.126872] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:564  UCX  DEBUG lo: system device unknown
[1750907578.127715] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:939  UCX  DEBUG /sys/class/net/xgbe0: PF sysfs path is '/sys/devices/pci0000:54/0000:54:00.0/0000:55:00.0'
[1750907578.127799] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:304  UCX  DEBUG added sys_dev 2 for bus id 55:00.0
[1750907578.127808] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:560  UCX  DEBUG xgbe0: bdf_name 0000:55:00.0 sys_dev 2
[1750907578.129620] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md tcp because it has no selected transport resources
[1750907578.129708] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md sysv because it has no selected transport resources
[1750907578.129820] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md posix because it has no selected transport resources
[1750907578.129857] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    cuda_copy_md.c:111  UCX  DEBUG dmabuf is not supported on cuda device 0
[1750907578.129904] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md cuda_cpy because it has no selected transport resources
[1750907578.129925] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     cuda_ipc_md.c:515  UCX  DEBUG multi-node NVLINK support is disabled
[1750907578.136048] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:939  UCX  DEBUG /sys/class/infiniband/mlx5_0: PF sysfs path is '/sys/devices/pci0000:10/0000:10:00.0/0000:11:00.0/0000:12:10.0/0000:1c:00.0'
[1750907578.136099] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:304  UCX  DEBUG added sys_dev 3 for bus id 1c:00.0
[1750907578.136105] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:560  UCX  DEBUG mlx5_0: bdf_name 0000:1c:00.0 sys_dev 3
[1750907578.136135] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:533  UCX  DEBUG mlx5_0: vendor_id 0x15b3 device_id 4119
[1750907578.136544] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         ib_mlx5.h:1000 UCX  DEBUG mlx5dv_devx_general_cmd(QUERY_HCA_CAP, CAP2) failed on mlx5_0, syndrome 0x5add95: Remote I/O error
[1750907578.136553] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1939 UCX  DEBUG mlx5_0: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.136715] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1728 UCX  DEBUG mlx5_0: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.136922] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:248  UCX  DEBUG added async handler 0x18eb7b0 [id=35 ref 1] ???() to hash
[1750907578.137052] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:522  UCX  DEBUG listening to async event fd 35 events 0x1 mode thread_spinlock
[1750907578.137062] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:645  UCX  DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1750907578.137079] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.137090] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.137105] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:138  UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.137236] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2445 UCX  DEBUG mlx5_0: opened DEVX md log_max_qp=18
[1750907578.137572] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_0: KSM dm memory registration status "Success" range 0x7ffbbc05d000..0x7ffbbc05d020 iova 0x0 mkey_index 0x0
[1750907578.137674] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:115  UCX  DEBUG mlx5_0: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc05d000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.137680] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_0: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc05d000 atomic mkey_index 0x0
[1750907578.137939] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1203 UCX  DEBUG mlx5_0: relaxed order memory access is disabled
[1750907578.138109] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_0: KSM flush-mr memory registration status "Success" range 0x1948000..0x1948008 iova 0x0 mkey_index 0x0
[1750907578.138114] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2498 UCX  DEBUG mlx5_0: XGVMI is not supported
[1750907578.138119] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1063 UCX  DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.144564] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:939  UCX  DEBUG /sys/class/infiniband/mlx5_1: PF sysfs path is '/sys/devices/pci0000:54/0000:54:00.0/0000:55:00.0'
[1750907578.144576] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:560  UCX  DEBUG mlx5_1: bdf_name 0000:55:00.0 sys_dev 2
[1750907578.144594] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:533  UCX  DEBUG mlx5_1: vendor_id 0x15b3 device_id 4121
[1750907578.145096] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1954 UCX  DEBUG mlx5_1: mkey_by_name_reserve is not supported
[1750907578.145101] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1939 UCX  DEBUG mlx5_1: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.145231] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1728 UCX  DEBUG mlx5_1: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.145384] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:248  UCX  DEBUG added async handler 0x194b010 [id=40 ref 1] ???() to hash
[1750907578.145397] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:522  UCX  DEBUG listening to async event fd 40 events 0x1 mode thread_spinlock
[1750907578.145401] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:645  UCX  DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1750907578.145410] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.145416] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.145420] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:138  UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.145549] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2445 UCX  DEBUG mlx5_1: opened DEVX md log_max_qp=18
[1750907578.145969] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_1: KSM dm memory registration status "Success" range 0x7ffbbc052000..0x7ffbbc052020 iova 0x0 mkey_index 0x0
[1750907578.146074] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:115  UCX  DEBUG mlx5_1: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc052000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.146080] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_1: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc052000 atomic mkey_index 0x0
[1750907578.146426] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1203 UCX  DEBUG mlx5_1: relaxed order memory access is disabled
[1750907578.146594] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_1: KSM flush-mr memory registration status "Success" range 0x194d000..0x194d008 iova 0x0 mkey_index 0x0
[1750907578.146602] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2498 UCX  DEBUG mlx5_1: XGVMI is not supported
[1750907578.146605] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1063 UCX  DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.147977] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md mlx5_1 because it has no selected transport resources
[1750907578.147986] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2535 UCX  DEBUG mlx5_1: md=0x18f5f30 md->flags=0x3f101ab flush_rkey=0x6000
[1750907578.148161] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:194  UCX  DEBUG mpool devx dbrec destroyed
[1750907578.148175] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:664  UCX  DEBUG destroying ib device mlx5_1
[1750907578.148186] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:173  UCX  DEBUG removed async handler 0x194b010 [id=40 ref 1] ???() from hash
[1750907578.148190] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:576  UCX  DEBUG removing async handler 0x194b010 [id=40 ref 1] ???()
[1750907578.148195] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:188  UCX  DEBUG release async handler 0x194b010 [id=40 ref 0] ???()
[1750907578.156608] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:939  UCX  DEBUG /sys/class/infiniband/mlx5_2: PF sysfs path is '/sys/devices/pci0000:54/0000:54:00.0/0000:55:00.1'
[1750907578.156650] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:304  UCX  DEBUG added sys_dev 4 for bus id 55:00.1
[1750907578.156654] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:560  UCX  DEBUG mlx5_2: bdf_name 0000:55:00.1 sys_dev 4
[1750907578.156668] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:533  UCX  DEBUG mlx5_2: vendor_id 0x15b3 device_id 4121
[1750907578.157162] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1954 UCX  DEBUG mlx5_2: mkey_by_name_reserve is not supported
[1750907578.157168] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1939 UCX  DEBUG mlx5_2: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.157301] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1728 UCX  DEBUG mlx5_2: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.157453] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:248  UCX  DEBUG added async handler 0x1947fa0 [id=40 ref 1] ???() to hash
[1750907578.157464] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:522  UCX  DEBUG listening to async event fd 40 events 0x1 mode thread_spinlock
[1750907578.157468] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:645  UCX  DEBUG initialized device 'mlx5_2' (InfiniBand channel adapter) with 1 ports
[1750907578.157477] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_2: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.157483] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_2: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.157488] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:138  UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.157610] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2445 UCX  DEBUG mlx5_2: opened DEVX md log_max_qp=18
[1750907578.157924] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_2: KSM dm memory registration status "Success" range 0x7ffbbc052000..0x7ffbbc052020 iova 0x0 mkey_index 0x0
[1750907578.158023] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:115  UCX  DEBUG mlx5_2: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc052000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.158028] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_2: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc052000 atomic mkey_index 0x0
[1750907578.158249] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1203 UCX  DEBUG mlx5_2: relaxed order memory access is disabled
[1750907578.158421] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_2: KSM flush-mr memory registration status "Success" range 0x19a4000..0x19a4008 iova 0x0 mkey_index 0x0
[1750907578.158425] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2498 UCX  DEBUG mlx5_2: XGVMI is not supported
[1750907578.158429] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1063 UCX  DEBUG mlx5_2: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.159247] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md mlx5_2 because it has no selected transport resources
[1750907578.159256] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2535 UCX  DEBUG mlx5_2: md=0x194ba70 md->flags=0x3f101ab flush_rkey=0x44500
[1750907578.159419] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:194  UCX  DEBUG mpool devx dbrec destroyed
[1750907578.159423] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:664  UCX  DEBUG destroying ib device mlx5_2
[1750907578.159428] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:173  UCX  DEBUG removed async handler 0x1947fa0 [id=40 ref 1] ???() from hash
[1750907578.159432] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:576  UCX  DEBUG removing async handler 0x1947fa0 [id=40 ref 1] ???()
[1750907578.159437] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:188  UCX  DEBUG release async handler 0x1947fa0 [id=40 ref 0] ???()
[1750907578.169482] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:939  UCX  DEBUG /sys/class/infiniband/mlx5_3: PF sysfs path is '/sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0/0000:66:14.0/0000:70:00.0'
[1750907578.169528] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:304  UCX  DEBUG added sys_dev 5 for bus id 70:00.0
[1750907578.169533] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]            topo.c:560  UCX  DEBUG mlx5_3: bdf_name 0000:70:00.0 sys_dev 5
[1750907578.169550] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:533  UCX  DEBUG mlx5_3: vendor_id 0x15b3 device_id 4119
[1750907578.170421] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         ib_mlx5.h:1000 UCX  DEBUG mlx5dv_devx_general_cmd(QUERY_HCA_CAP, CAP2) failed on mlx5_3, syndrome 0x5add95: Remote I/O error
[1750907578.170430] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1939 UCX  DEBUG mlx5_3: dp_ordering support: force=0 ooo_rw_rc=1 ooo_rw_dc=1
[1750907578.170613] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:1728 UCX  DEBUG mlx5_3: ODP is disabled because version 1 is not supported for DevX QP
[1750907578.170812] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:248  UCX  DEBUG added async handler 0x19a7010 [id=40 ref 1] ???() to hash
[1750907578.170825] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:522  UCX  DEBUG listening to async event fd 40 events 0x1 mode thread_spinlock
[1750907578.170829] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:645  UCX  DEBUG initialized device 'mlx5_3' (InfiniBand channel adapter) with 1 ports
[1750907578.170842] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_3: cuda GPUDirect RDMA is detected by checking /sys/kernel/mm/memory_peers/nv_mem/version
[1750907578.170849] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1219 UCX  DEBUG mlx5_3: rocm GPUDirect RDMA is not detected by checking /dev/kfd
[1750907578.170854] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:138  UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1750907578.170993] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2445 UCX  DEBUG mlx5_3: opened DEVX md log_max_qp=18
[1750907578.171337] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_3: KSM dm memory registration status "Success" range 0x7ffbbc052000..0x7ffbbc052020 iova 0x0 mkey_index 0x0
[1750907578.171423] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:115  UCX  DEBUG mlx5_3: mlx5dv_devx_obj_create(CREATE_MKEY, mode=KSM, start_addr=0x7ffbbc052000 length=32) failed, syndrome 0xee806c: Remote I/O error
[1750907578.171429] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_3: KSM atomic-key memory registration status "Unsupported operation" range (nil)..0x20 iova 0x7ffbbc052000 atomic mkey_index 0x0
[1750907578.171667] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1203 UCX  DEBUG mlx5_3: relaxed order memory access is disabled
[1750907578.171830] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:145  UCX  DEBUG mlx5_3: KSM flush-mr memory registration status "Success" range 0x199d000..0x199d008 iova 0x0 mkey_index 0x0
[1750907578.171834] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2498 UCX  DEBUG mlx5_3: XGVMI is not supported
[1750907578.171838] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           ib_md.c:1063 UCX  DEBUG mlx5_3: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1750907578.172748] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md mlx5_3 because it has no selected transport resources
[1750907578.172758] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2535 UCX  DEBUG mlx5_3: md=0x19a13b0 md->flags=0x3f1012f flush_rkey=0x4200
[1750907578.172932] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:194  UCX  DEBUG mpool devx dbrec destroyed
[1750907578.172936] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:664  UCX  DEBUG destroying ib device mlx5_3
[1750907578.172942] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:173  UCX  DEBUG removed async handler 0x19a7010 [id=40 ref 1] ???() from hash
[1750907578.172946] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:576  UCX  DEBUG removing async handler 0x19a7010 [id=40 ref 1] ???()
[1750907578.172953] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:188  UCX  DEBUG release async handler 0x19a7010 [id=40 ref 0] ???()
[1750907578.173576] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]             sys.c:440  UCX  DEBUG failed to open /proc/sys/kernel/yama/ptrace_scope: No such file or directory
[1750907578.173580] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          cma_md.c:69   UCX  DEBUG could not read '/proc/sys/kernel/yama/ptrace_scope' - assuming Yama security is not enforced
[1750907578.173619] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1662 UCX  DEBUG closing md cma because it has no selected transport resources
[1750907578.173631] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1785 UCX  DEBUG register host memory on: mlx5_0
[1750907578.173634] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1785 UCX  DEBUG register cuda memory on: cuda_ipc, mlx5_0
[1750907578.173638] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering cuda-managed memory
[1750907578.173641] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering rocm memory
[1750907578.173644] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering rocm-managed memory
[1750907578.173648] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering rdma memory
[1750907578.173650] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering ze-host memory
[1750907578.173651] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering ze-device memory
[1750907578.173653] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:1773 UCX  DEBUG no memory domain supports registering ze-managed memory
[1750907578.173700] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:138  UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1750907578.173727] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:248  UCX  DEBUG added async handler 0x18f2980 [id=39 ref 1] ???() to hash
[1750907578.173737] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:522  UCX  DEBUG listening to async event fd 39 events 0x1 mode thread_spinlock
[1750907578.173852] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]          module.c:304  UCX  DEBUG loading modules for ucm
[1750907578.247965] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]     ucp_context.c:2415 UCX  DEBUG created ucp context perftest 0x18d39e0 [2 mds 5 tls] features 0x3 tl bitmap 0x1f 0x0
[1750907578.248051] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         uct_mem.c:310  UCX  DEBUG   could not allocate user memory: cuda memory length=1024 flags=0x360 num_methods=1
[1750907578.248063] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         uct_mem.c:310  UCX  DEBUG   could not allocate user memory: cuda memory length=1024 flags=0x360 num_methods=1
[1750907578.248068] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         uct_mem.c:310  UCX  DEBUG   could not allocate user memory: cuda memory length=1024 flags=0x360 num_methods=1
[1750907578.248074] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         libperf.c:2064 UCX  WARN  ucp test failed to allocate memory
[1750907578.248112] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:173  UCX  DEBUG removed async handler 0x18f2980 [id=39 ref 1] ???() from hash
[1750907578.248118] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:576  UCX  DEBUG removing async handler 0x18f2980 [id=39 ref 1] ???()
[1750907578.248128] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:188  UCX  DEBUG release async handler 0x18f2980 [id=39 ref 0] ???()
[1750907578.248158] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]         pgtable.c:618  UCX  DEBUG purge empty page table
[1750907578.248165] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:194  UCX  DEBUG mpool rcache_mp destroyed
[1750907578.248190] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    ib_mlx5dv_md.c:2535 UCX  DEBUG mlx5_0: md=0x18f5220 md->flags=0x3f1012f flush_rkey=0x1fca00
[1750907578.248439] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           mpool.c:194  UCX  DEBUG mpool devx dbrec destroyed
[1750907578.248448] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]       ib_device.c:664  UCX  DEBUG destroying ib device mlx5_0
[1750907578.248453] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:173  UCX  DEBUG removed async handler 0x18eb7b0 [id=35 ref 1] ???() from hash
[1750907578.248457] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:576  UCX  DEBUG removing async handler 0x18eb7b0 [id=35 ref 1] ???()
[1750907578.248556] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]           async.c:188  UCX  DEBUG release async handler 0x18eb7b0 [id=35 ref 0] ???()
[1750907578.249234] [bddwd-inf-k8s-a100-ab2-0014:7568 :0]    perftest_run.c:345  UCX  ERROR Failed to run test: Out of memory

client log:

UCX_TLS=rc ucx_perftest 127.0.0.1 -t tag_lat -c 1 -s 809600000 -m cuda
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |        latency (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1750906436.727419] [bddwd-inf-k8s-a100-ab2-0014:160066:0]         libperf.c:2064 UCX  WARN  ucp test failed to allocate memory
[1750906436.730516] [bddwd-inf-k8s-a100-ab2-0014:160066:0]    perftest_run.c:345  UCX  ERROR Failed to run test: Out of memory

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • ofed_info -s MLNX_OFED_LINUX-5.0-2.1.8.0:
    • bv_devinfo -vv
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.28.1002
        node_guid:                      08c0:eb03:00dd:ee46
        sys_image_guid:                 08c0:eb03:00dd:ee46
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000010
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffffffffff000
        max_qp:                         262144
        max_qp_wr:                      32768
        device_cap_flags:               0xe97e1c36
                                        BAD_PKEY_CNTR
                                        BAD_QKEY_CNTR
                                        AUTO_PATH_MIG
                                        CHANGE_PHY_PORT
                                        PORT_ACTIVE_EVENT
                                        SYS_IMAGE_GUID
                                        RC_RNR_NAK_GEN
                                        MEM_WINDOW
                                        UD_IP_CSUM
                                        XRC
                                        MEM_MGT_EXTENSIONS
                                        MEM_WINDOW_TYPE_2B
                                        MANAGED_FLOW_STEERING
                                        Unknown flags: 0xC8480000
        max_sge:                        30
        max_sge_rd:                     30
        max_cq:                         16777216
        max_cqe:                        4194303
        max_mr:                         16777216
        max_pd:                         16777216
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                4194304
        max_qp_init_rd_atom:            16
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         16777216
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  2097152
        max_mcast_qp_attach:            240
        max_total_mcast_qp_attach:      503316480
        max_ah:                         2147483647
        max_fmr:                        0
        max_srq:                        8388608
        max_srq_wr:                     32767
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             16
        general_odp_caps:
                                        ODP_SUPPORT
                                        ODP_SUPPORT_IMPLICIT
        rc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_RECV
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        SUPPORT_SEND
        xrc_odp_caps:
                                        SUPPORT_SEND
                                        SUPPORT_WRITE
                                        SUPPORT_READ
                                        SUPPORT_SRQ
        completion timestamp_mask:                      0x7fffffffffffffff
        hca_core_clock:                 78125kHZ
        device_cap_flags_ex:            0x30000051E97E1C36
                                        PCI_WRITE_END_PADDING
                                        Unknown flags: 0x3000004100000000
        tso_caps:
        max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        max_rndv_hdr_size:              64
        max_num_tags:                   127
        max_ops:                        32768
        max_sge:                        1
        flags:
                                        IBV_TM_CAP_RC
  • For GPU related issues:
    • GPU type
    • Cuda:
      • nvcc --version
        • nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv
        • no output

Additional information (depending on the issue)

  • nvidia-smi nvlink --status
    GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-0c15878b-9b12-3744-97ed-e25adeac57c3)
    Link 0: 25 GB/s
    Link 1: 25 GB/s
    Link 2: 25 GB/s
    Link 3: 25 GB/s
    Link 4: 25 GB/s
    Link 5: 25 GB/s
    Link 6: 25 GB/s
    Link 7: 25 GB/s
    Link 8: 25 GB/s
    Link 9: 25 GB/s
    Link 10: 25 GB/s
    Link 11: 25 GB/s
    GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-36ea1ca4-4d0d-e206-5795-ab3c326154ef)
    Link 0: 25 GB/s
    Link 1: 25 GB/s
    Link 2: 25 GB/s
    Link 3: 25 GB/s
    Link 4: 25 GB/s
    Link 5: 25 GB/s
    Link 6: 25 GB/s
    Link 7: 25 GB/s
    Link 8: 25 GB/s
    Link 9: 25 GB/s
    Link 10: 25 GB/s
    Link 11: 25 GB/s
  • p2pBandwidthLatencyTest
    • [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
      Device: 0, NVIDIA A100-SXM4-40GB, pciBusID: 17, pciDeviceID: 0, pciDomainID:0
      Device: 1, NVIDIA A100-SXM4-40GB, pciBusID: 1a, pciDeviceID: 0, pciDomainID:0
      Device=0 CAN Access Peer Device=1
      Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1212.18 9.58
1 8.30 1302.08
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1287.07 272.76
1 276.54 1298.84
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1308.63 10.13
1 10.07 625.25
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1299.92 326.65
1 472.64 627.51
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.73 27.77
1 28.56 2.21

CPU 0 1
0 3.23 9.78
1 11.16 3.16
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.72 3.46
1 3.54 2.21

CPU 0 1
0 3.10 2.58
1 2.63 4.68

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions