Skip to content

Software emulation being used for remote memory write instead of zero-copy #10950

@andjsmi

Description

@andjsmi

Describe the bug

I am trying to use RDMA over the srd transport but am finding that UCX_PROTO_INFO=y shows that it is using "software emulation" instead of "zero-copy" when using NIXL.

See below:

 +--------------------------------+-------------------------------------------------------------+
 | ucp_context_0 inter-node cfg#3 | remote memory write by ucp_put* from host memory to cuda    |
+--------------------------------+------------------------------------------+------------------+
 |                         0..inf | software emulation                       | srd/rdmap135s0:1 |
+--------------------------------+------------------------------------------+------------------+

However inside the same Pod when I run ucx_perftest it shows zero-copy

+---------------------------+---------------------------------------------------------------------------------------------------------------+
| perftest inter-node cfg#2 | remote memory write by ucp_put* from host memory to cuda                                                      |
   +---------------------------+-----------+---------------------------------------------------------------------------------------------------+
|                   0..1651 | copy-in   | srd/rdmap85s0:1                                                                                   |
 |                 1652..inf | zero-copy | 25% on srd/rdmap85s0:1, 25% on srd/rdmap86s0:1, 25% on srd/rdmap87s0:1 and 25% on srd/rdmap88s0:1 |
   +---------------------------+-----------+---------------------------------------------------------------------------------------------------+

What is the difference between the two configs so that ucx_perftest uses SRD with RDMA but Nixl using UCX backend uses software emulation?

Steps to Reproduce

  • Command line

  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)

  • Using commit: 7ec95b95e524a87e81cac92f5ca8523e3966b16b

  • Any UCX environment variables used

  • name: UCX_RNDV_THRESH
    value: "inf"

  • name: UCX_MAX_COMPONENT_MDS
    value: "32"

  • name: UCX_MAX_RMA_LANES
    value: "4"

  • name: UCX_PROTO_INFO
    value: "y"

  • name: UCX_RNDV_SCHEME
    value: "put_zcopy"

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
      -> Running inside Kubernetes on Ubuntu 24.04
    • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
        -> rdma-core = rdma-core/noble-updates,now 50.0-2ubuntu0.2 amd64
        -> libibverbs = libibverbs1/noble-updates,now 50.0-2ubuntu0.2 amd64 [installed]
      • or: MLNX_OFED version ofed_info -s
    • HW information from ibstat or ibv_devinfo -vv command

hca_id: rdmap137s0
        transport:                      unspecified (4)
        fw_ver:                         0.0.0.0
        node_guid:                      398e:ae8a:0001:1400
        sys_image_guid:                 0000:0000:0000:0000
        vendor_id:                      0x1d0f
        vendor_part_id:                 61346
        hw_ver:                         0xEFA2
        phys_port_cnt:                  1
        max_mr_size:                    0x3000000000
        page_size_cap:                  0xfffff000
        max_qp:                         256
        max_qp_wr:                      4096
        device_cap_flags:               0x00000000
        max_sge:                        2
        max_sge_rd:                     1
        max_cq:                         512
        max_cqe:                        32768
        max_mr:                         262144
        max_pd:                         256
        max_qp_rd_atom:                 0
        max_ee_rd_atom:                 0
        max_res_rd_atom:                0
        max_qp_init_rd_atom:            0
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_NONE (0)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  0
        max_mcast_qp_attach:            0
        max_total_mcast_qp_attach:      0
        max_ah:                         1024
        max_fmr:                        0
        max_srq:                        0
        max_pkeys:                      1
        local_ca_ack_delay:             0
        general_odp_caps:
        rc_odp_caps:
                                        NO SUPPORT
        uc_odp_caps:
                                        NO SUPPORT
        ud_odp_caps:
                                        NO SUPPORT
        xrc_odp_caps:
                                        NO SUPPORT
        completion_timestamp_mask not supported
        core clock not supported
        device_cap_flags_ex:            0x0
        tso_caps:
                max_tso:                        0
        rss_caps:
                max_rwq_indirection_tables:                     0
                max_rwq_indirection_table_size:                 0
                rx_hash_function:                               0x0
                rx_hash_fields_mask:                            0x0
        max_wq_type_rq:                 0
        packet_pacing_caps:
                qp_rate_limit_min:      0kbps
                qp_rate_limit_max:      0kbps
        tag matching not supported
        num_comp_vectors:               32
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x01
                        link_layer:             Unspecified
                        max_msg_sz:             0x22e0
                        port_cap_flags:         0x00000000
                        port_cap_flags2:        0x0000
                        max_vl_num:             1 (1)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           1
                        gid_tbl_len:            1
                        subnet_timeout:         0
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           50.0 Gbps (64)
                        GID[  0]:               fe80:0000:0000:0000:0491:8aff:feae:8e39
  • For GPU related issues:
    • GPU type
    • Cuda:
      • Drivers version
      • Check if peer-direct is loaded: lsmod|grep nv_peer_mem and/or gdrcopy: lsmod|grep gdrdrv

Additional information (depending on the issue)

  • OpenMPI version
  • Output of ucx_info -d to show transports and devices recognized by UCX
Transport: srd
#         Device: rdmap162s0:1
#           Type: network
#  System device: rdmap162s0 (22)
#
#      capabilities:
#            bandwidth: 23571.39/ppn + 0.00 MB/sec
#              latency: 620 nsec
#             overhead: 75 nsec
#            put_bcopy: <= 4K
#            put_zcopy: <= 1G, up to 1 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 4K
#            get_zcopy: <= 1G, up to 1 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 21
#             am_bcopy: <= 4085
#             am_zcopy: <= 4085, up to 1 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 4085
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 11 bytes
#        iface address: 3 bytes
#       error handling: peer failure
  • Configure result - config.log
  • Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions