prov/shm getting performance issue in system to cuda memory transfer

Hello there,

We're evaluating libfabric as an abstraction layer for inter-host and intra-host video frame memory transfers. During performance comparisons between libfabric and native library calls (e.g., cudaMemcpy), we've observed unexpected performance characteristics specifically when transferring intra-host from system memory to CUDA memory. The reverse direction (CUDA to system memory) performs as expected.

### Configuration Details
OS: Linux 6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC T x86_64 GNU/Linux
Version: libfabric v2.2.0
Provider: shm
Transfer method: Remote write with immediate data via `fi_writemsg`
Protocol: We're not using CMA or XPMEM, so I assume it's using SAR, at least that's what the call stack shows.
Completion handling: Tested both busy-wait polling (`fi_cq_read`) and `fi_cq_sread` - no latency difference observed between approaches
Process 1 registers system memory, process 2 registers CUDA memory. For system to CUDA transfers, process 1 is the initiator and process 2 is the target. For CUDA to system transfers, process 1 is the target and process 2 is the initiator


### Problem
CUDA-to-system memory transfers perform as expected and show comparable results to native calls. However, system-to-CUDA memory transfers exhibit unexpected performance degradation, suggesting an issue with this specific transfer direction when using the shm provider. Here's the results:

#### System-to-CUDA 
   | TransferSize(Bytes) | Latency(us) | CPU Usage | Library |
   |:---:|:---:|:---:|:---:|
   | 2547840 | 627 | 0.04 | libfabric |
   | 2547840 | 173 | 0.01 | cuda |
   | 22389120 | 5341 | 0.32 | libfabric |
   | 22389120 | 850 | 0.05 | cuda |


#### CUDA-to-system
   | TransferSize(Bytes) | Latency(us) | CPU Usage | Library |
   |:---:|:---:|:---:|:---:|
   | 2547840 | 184 | 0.014 | libfabric |
   | 2547840 | 182 | 0.014 | cuda |
   | 22389120 | 850 | 0.055 | libfabric |
   | 22389120 | 834 | 0.056 | cuda |


We've also attempted to reproduce the result with fabtests `fi_rma_bw` and `fi_rma_pingpong`, but we get mixed results, which you might have an explanation for.

fi_rma_bw:
```
$ fi_rma_bw -p shm -s test1 -S 2547840 -c sread  -o writedata -D cuda`
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.13s   4017.53     634.18       0.00

$ fi_rma_bw -p shm -S 2547840 -c sread -o writedata  test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.12s   4216.11     604.31       0.00
```

```
$ fi_rma_bw -p shm -s test1 -S 2547840 -c sread  -o writedata
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.03s  18179.38     140.15       0.01

$ fi_rma_bw -p shm -S 2547840 -c sread -o writedata  -D cuda test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    200     485m        0.03s  18180.68     140.14       0.01
```

fi_rma_pingpong:
```
$ fi_rma_pingpong -p shm -s test1 -S 2547840 -c sread -o writedata -D cuda
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7093.49     359.18       0.00

$ fi_rma_pingpong -p shm -S 2547840 -c sread -o writedata test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7093.49     359.18       0.00
```

```
$ fi_rma_pingpong -p shm -s test1 -S 2547840 -c sread -o writedata
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7099.52     358.88       0.00

$ fi_rma_pingpong -p shm -S 2547840 -c sread -o writedata -D cuda test1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
2.4m    100     485m        0.07s   7099.62     358.87       0.00
```

fi_info dump:
```
    caps: [ FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_RMA_EVENT, FI_HMEM ]
    mode: [  ]
    addr_format: FI_ADDR_STR
    src_addrlen: 20
    dest_addrlen: 0
    src_addr: fi_ns://test1:10001
    dest_addr: (null)
    handle: (nil)
    fi_tx_attr:
        caps: [ FI_RMA, FI_READ, FI_WRITE, FI_HMEM ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_SAS ]
        inject_size: 0
        size: 1024
        iov_limit: 4
        rma_iov_limit: 4
        tclass: 0x0
    fi_rx_attr:
        caps: [ FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_EVENT, FI_SOURCE, FI_HMEM ]
        mode: [  ]
        op_flags: [  ]
        msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
        size: 1024
        iov_limit: 4
    fi_ep_attr:
        type: FI_EP_RDM
        protocol: FI_PROTO_SHM
        protocol_version: 1
        max_msg_size: 18446744073709551615
        msg_prefix_size: 0
        max_order_raw_size: 0
        max_order_war_size: 0
        max_order_waw_size: 0
        mem_tag_format: 0xaaaaaaaaaaaaaaaa
        tx_ctx_cnt: 1
        rx_ctx_cnt: 1
        auth_key_size: 0
    fi_domain_attr:
        domain: 0x0
        name: shm
        threading: FI_THREAD_SAFE
        progress: FI_PROGRESS_MANUAL
        resource_mgmt: FI_RM_ENABLED
        av_type: FI_AV_UNSPEC
        mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_HMEM ]
        mr_key_size: 8
        cq_data_size: 8
        cq_cnt: 1024
        ep_cnt: 256
        tx_ctx_cnt: 1024
        rx_ctx_cnt: 1024
        max_ep_tx_ctx: 1
        max_ep_rx_ctx: 1
        max_ep_stx_ctx: 0
        max_ep_srx_ctx: 0
        cntr_cnt: 0
        mr_iov_limit: 4
        caps: [ FI_LOCAL_COMM ]
        mode: [  ]
        auth_key_size: 0
        max_err_data: 0
        mr_cnt: 0
        tclass: 0x0
    fi_fabric_attr:
        name: shm
        prov_name: shm
        prov_version: 202.0
        api_version: 2.2
    nic: (nil)
```

Any ideas?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prov/shm getting performance issue in system to cuda memory transfer #11407

Configuration Details

Problem

System-to-CUDA

CUDA-to-system

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TransferSize(Bytes)	Latency(us)	CPU Usage	Library
2547840	627	0.04	libfabric
2547840	173	0.01	cuda
22389120	5341	0.32	libfabric
22389120	850	0.05	cuda

TransferSize(Bytes)	Latency(us)	CPU Usage	Library
2547840	184	0.014	libfabric
2547840	182	0.014	cuda
22389120	850	0.055	libfabric
22389120	834	0.056	cuda

prov/shm getting performance issue in system to cuda memory transfer #11407

Description

Configuration Details

Problem

System-to-CUDA

CUDA-to-system

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions