-
Notifications
You must be signed in to change notification settings - Fork 497
Description
Describe the bug
I am trying to use RDMA over the srd transport but am finding that UCX_PROTO_INFO=y shows that it is using "software emulation" instead of "zero-copy" when using NIXL.
See below:
+--------------------------------+-------------------------------------------------------------+
| ucp_context_0 inter-node cfg#3 | remote memory write by ucp_put* from host memory to cuda |
+--------------------------------+------------------------------------------+------------------+
| 0..inf | software emulation | srd/rdmap135s0:1 |
+--------------------------------+------------------------------------------+------------------+
However inside the same Pod when I run ucx_perftest it shows zero-copy
+---------------------------+---------------------------------------------------------------------------------------------------------------+
| perftest inter-node cfg#2 | remote memory write by ucp_put* from host memory to cuda |
+---------------------------+-----------+---------------------------------------------------------------------------------------------------+
| 0..1651 | copy-in | srd/rdmap85s0:1 |
| 1652..inf | zero-copy | 25% on srd/rdmap85s0:1, 25% on srd/rdmap86s0:1, 25% on srd/rdmap87s0:1 and 25% on srd/rdmap88s0:1 |
+---------------------------+-----------+---------------------------------------------------------------------------------------------------+
What is the difference between the two configs so that ucx_perftest uses SRD with RDMA but Nixl using UCX backend uses software emulation?
Steps to Reproduce
-
Command line
-
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v) -
Using commit:
7ec95b95e524a87e81cac92f5ca8523e3966b16b -
Any UCX environment variables used
-
name: UCX_RNDV_THRESH
value: "inf" -
name: UCX_MAX_COMPONENT_MDS
value: "32" -
name: UCX_MAX_RMA_LANES
value: "4" -
name: UCX_PROTO_INFO
value: "y" -
name: UCX_RNDV_SCHEME
value: "put_zcopy"
Setup and versions
- OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issueorcat /etc/redhat-release+uname -a
-> Running inside Kubernetes on Ubuntu 24.04- For Nvidia Bluefield SmartNIC include
cat /etc/mlnx-release(the string identifies software and firmware setup)
- For RDMA/IB/RoCE related issues:
- Driver version:
rpm -q rdma-coreorrpm -q libibverbs
-> rdma-core = rdma-core/noble-updates,now 50.0-2ubuntu0.2 amd64
-> libibverbs = libibverbs1/noble-updates,now 50.0-2ubuntu0.2 amd64 [installed]- or: MLNX_OFED version
ofed_info -s
-
HW information from
ibstatoribv_devinfo -vvcommand
- Driver version:
hca_id: rdmap137s0
transport: unspecified (4)
fw_ver: 0.0.0.0
node_guid: 398e:ae8a:0001:1400
sys_image_guid: 0000:0000:0000:0000
vendor_id: 0x1d0f
vendor_part_id: 61346
hw_ver: 0xEFA2
phys_port_cnt: 1
max_mr_size: 0x3000000000
page_size_cap: 0xfffff000
max_qp: 256
max_qp_wr: 4096
device_cap_flags: 0x00000000
max_sge: 2
max_sge_rd: 1
max_cq: 512
max_cqe: 32768
max_mr: 262144
max_pd: 256
max_qp_rd_atom: 0
max_ee_rd_atom: 0
max_res_rd_atom: 0
max_qp_init_rd_atom: 0
max_ee_init_rd_atom: 0
atomic_cap: ATOMIC_NONE (0)
max_ee: 0
max_rdd: 0
max_mw: 0
max_raw_ipv6_qp: 0
max_raw_ethy_qp: 0
max_mcast_grp: 0
max_mcast_qp_attach: 0
max_total_mcast_qp_attach: 0
max_ah: 1024
max_fmr: 0
max_srq: 0
max_pkeys: 1
local_ca_ack_delay: 0
general_odp_caps:
rc_odp_caps:
NO SUPPORT
uc_odp_caps:
NO SUPPORT
ud_odp_caps:
NO SUPPORT
xrc_odp_caps:
NO SUPPORT
completion_timestamp_mask not supported
core clock not supported
device_cap_flags_ex: 0x0
tso_caps:
max_tso: 0
rss_caps:
max_rwq_indirection_tables: 0
max_rwq_indirection_table_size: 0
rx_hash_function: 0x0
rx_hash_fields_mask: 0x0
max_wq_type_rq: 0
packet_pacing_caps:
qp_rate_limit_min: 0kbps
qp_rate_limit_max: 0kbps
tag matching not supported
num_comp_vectors: 32
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x01
link_layer: Unspecified
max_msg_sz: 0x22e0
port_cap_flags: 0x00000000
port_cap_flags2: 0x0000
max_vl_num: 1 (1)
bad_pkey_cntr: 0x0
qkey_viol_cntr: 0x0
sm_sl: 0
pkey_tbl_len: 1
gid_tbl_len: 1
subnet_timeout: 0
init_type_reply: 0
active_width: 4X (2)
active_speed: 50.0 Gbps (64)
GID[ 0]: fe80:0000:0000:0000:0491:8aff:feae:8e39
- For GPU related issues:
- GPU type
- Cuda:
- Drivers version
- Check if peer-direct is loaded:
lsmod|grep nv_peer_memand/or gdrcopy:lsmod|grep gdrdrv
Additional information (depending on the issue)
- OpenMPI version
- Output of
ucx_info -dto show transports and devices recognized by UCX
Transport: srd
# Device: rdmap162s0:1
# Type: network
# System device: rdmap162s0 (22)
#
# capabilities:
# bandwidth: 23571.39/ppn + 0.00 MB/sec
# latency: 620 nsec
# overhead: 75 nsec
# put_bcopy: <= 4K
# put_zcopy: <= 1G, up to 1 iov
# put_opt_zcopy_align: <= 512
# put_align_mtu: <= 4K
# get_bcopy: <= 4K
# get_zcopy: <= 1G, up to 1 iov
# get_opt_zcopy_align: <= 512
# get_align_mtu: <= 4K
# am_short: <= 21
# am_bcopy: <= 4085
# am_zcopy: <= 4085, up to 1 iov
# am_opt_zcopy_align: <= 512
# am_align_mtu: <= 4K
# am header: <= 4085
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 11 bytes
# iface address: 3 bytes
# error handling: peer failure
- Configure result - config.log
- Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"