Skip to content

prov/shm getting performance issue in system to cuda memory transfer #11407

@mlefebvre1

Description

@mlefebvre1

Hello there,

We're evaluating libfabric as an abstraction layer for inter-host and intra-host video frame memory transfers. During performance comparisons between libfabric and native library calls (e.g., cudaMemcpy), we've observed unexpected performance characteristics specifically when transferring intra-host from system memory to CUDA memory. The reverse direction (CUDA to system memory) performs as expected.

Configuration Details

OS: Linux 6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC T x86_64 GNU/Linux
Version: libfabric v2.2.0
Provider: shm
Transfer method: Remote write with immediate data via fi_writemsg
Protocol: We're not using CMA or XPMEM, so I assume it's using SAR, at least that's what the call stack shows.
Completion handling: Tested both busy-wait polling (fi_cq_read) and fi_cq_sread - no latency difference observed between approaches
Process 1 registers system memory, process 2 registers CUDA memory. For system to CUDA transfers, process 1 is the initiator and process 2 is the target. For CUDA to system transfers, process 1 is the target and process 2 is the initiator

Problem

CUDA-to-system memory transfers perform as expected and show comparable results to native calls. However, system-to-CUDA memory transfers exhibit unexpected performance degradation, suggesting an issue with this specific transfer direction when using the shm provider. Here's the results:

System-to-CUDA

TransferSize(Bytes) Latency(us) CPU Usage Library
2547840 627 0.04 libfabric
2547840 173 0.01 cuda
22389120 5341 0.32 libfabric
22389120 850 0.05 cuda

CUDA-to-system

TransferSize(Bytes) Latency(us) CPU Usage Library
2547840 184 0.014 libfabric
2547840 182 0.014 cuda
22389120 850 0.055 libfabric
22389120 834 0.056 cuda

We've also attempted to reproduce the result with fabtests fi_rma_bw and fi_rma_pingpong, but we get mixed results, which you might have an explanation for.

fi_rma_bw:

$ fi_rma_bw -p shm -s test1 -S 2547840 -c sread  -o writedata -D cuda`
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.13s   4017.53     634.18       0.00

$ fi_rma_bw -p shm -S 2547840 -c sread -o writedata  test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.12s   4216.11     604.31       0.00
$ fi_rma_bw -p shm -s test1 -S 2547840 -c sread  -o writedata
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.03s  18179.38     140.15       0.01

$ fi_rma_bw -p shm -S 2547840 -c sread -o writedata  -D cuda test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.03s  18180.68     140.14       0.01

fi_rma_pingpong:

$ fi_rma_pingpong -p shm -s test1 -S 2547840 -c sread -o writedata -D cuda
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7093.49     359.18       0.00

$ fi_rma_pingpong -p shm -S 2547840 -c sread -o writedata test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7093.49     359.18       0.00
$ fi_rma_pingpong -p shm -s test1 -S 2547840 -c sread -o writedata
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7099.52     358.88       0.00

$ fi_rma_pingpong -p shm -S 2547840 -c sread -o writedata -D cuda test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7099.62     358.87       0.00

fi_info dump:

    caps: [ FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_RMA_EVENT, FI_HMEM ]
    mode: [  ]
    addr_format: FI_ADDR_STR
    src_addrlen: 20
    dest_addrlen: 0
    src_addr: fi_ns://test1:10001
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_RMA, FI_READ, FI_WRITE, FI_HMEM ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_SAS ]
        inject_size: 0
        size: 1024
        iov_limit: 4
        rma_iov_limit: 4
        tclass: 0x0
    fi_rx_attr:
        caps: [ FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_EVENT, FI_SOURCE, FI_HMEM ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        size: 1024
        iov_limit: 4
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_SHM
        protocol_version: 1
        max_msg_size: 18446744073709551615
        msg_prefix_size: 0
        max_order_raw_size: 0
        max_order_war_size: 0
        max_order_waw_size: 0
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: shm
        threading: FI_THREAD_SAFE
        progress: FI_PROGRESS_MANUAL
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_HMEM ]
        mr_key_size: 8
        cq_data_size: 8
        cq_cnt: 1024
        ep_cnt: 256
        tx_ctx_cnt: 1024
        rx_ctx_cnt: 1024
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 0
        mr_iov_limit: 4
        caps: [ FI_LOCAL_COMM ]
        mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 0
        tclass: 0x0
    fi_fabric_attr:
        name: shm
        prov_name: shm
        prov_version: 202.0
        api_version: 2.2
    nic: (nil)

Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions