-
Notifications
You must be signed in to change notification settings - Fork 456
Description
Hello there,
We're evaluating libfabric as an abstraction layer for inter-host and intra-host video frame memory transfers. During performance comparisons between libfabric and native library calls (e.g., cudaMemcpy), we've observed unexpected performance characteristics specifically when transferring intra-host from system memory to CUDA memory. The reverse direction (CUDA to system memory) performs as expected.
Configuration Details
OS: Linux 6.8.0-79-generic #79-Ubuntu SMP PREEMPT_DYNAMIC T x86_64 GNU/Linux
Version: libfabric v2.2.0
Provider: shm
Transfer method: Remote write with immediate data via fi_writemsg
Protocol: We're not using CMA or XPMEM, so I assume it's using SAR, at least that's what the call stack shows.
Completion handling: Tested both busy-wait polling (fi_cq_read) and fi_cq_sread - no latency difference observed between approaches
Process 1 registers system memory, process 2 registers CUDA memory. For system to CUDA transfers, process 1 is the initiator and process 2 is the target. For CUDA to system transfers, process 1 is the target and process 2 is the initiator
Problem
CUDA-to-system memory transfers perform as expected and show comparable results to native calls. However, system-to-CUDA memory transfers exhibit unexpected performance degradation, suggesting an issue with this specific transfer direction when using the shm provider. Here's the results:
System-to-CUDA
| TransferSize(Bytes) | Latency(us) | CPU Usage | Library |
|---|---|---|---|
| 2547840 | 627 | 0.04 | libfabric |
| 2547840 | 173 | 0.01 | cuda |
| 22389120 | 5341 | 0.32 | libfabric |
| 22389120 | 850 | 0.05 | cuda |
CUDA-to-system
| TransferSize(Bytes) | Latency(us) | CPU Usage | Library |
|---|---|---|---|
| 2547840 | 184 | 0.014 | libfabric |
| 2547840 | 182 | 0.014 | cuda |
| 22389120 | 850 | 0.055 | libfabric |
| 22389120 | 834 | 0.056 | cuda |
We've also attempted to reproduce the result with fabtests fi_rma_bw and fi_rma_pingpong, but we get mixed results, which you might have an explanation for.
fi_rma_bw:
$ fi_rma_bw -p shm -s test1 -S 2547840 -c sread -o writedata -D cuda`
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 200 485m 0.13s 4017.53 634.18 0.00
$ fi_rma_bw -p shm -S 2547840 -c sread -o writedata test1
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 200 485m 0.12s 4216.11 604.31 0.00
$ fi_rma_bw -p shm -s test1 -S 2547840 -c sread -o writedata
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 200 485m 0.03s 18179.38 140.15 0.01
$ fi_rma_bw -p shm -S 2547840 -c sread -o writedata -D cuda test1
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 200 485m 0.03s 18180.68 140.14 0.01
fi_rma_pingpong:
$ fi_rma_pingpong -p shm -s test1 -S 2547840 -c sread -o writedata -D cuda
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 100 485m 0.07s 7093.49 359.18 0.00
$ fi_rma_pingpong -p shm -S 2547840 -c sread -o writedata test1
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 100 485m 0.07s 7093.49 359.18 0.00
$ fi_rma_pingpong -p shm -s test1 -S 2547840 -c sread -o writedata
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 100 485m 0.07s 7099.52 358.88 0.00
$ fi_rma_pingpong -p shm -S 2547840 -c sread -o writedata -D cuda test1
bytes iters total time MB/sec usec/xfer Mxfers/sec
2.4m 100 485m 0.07s 7099.62 358.87 0.00
fi_info dump:
caps: [ FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_LOCAL_COMM, FI_RMA_EVENT, FI_HMEM ]
mode: [ ]
addr_format: FI_ADDR_STR
src_addrlen: 20
dest_addrlen: 0
src_addr: fi_ns://test1:10001
dest_addr: (null)
handle: (nil)
fi_tx_attr:
caps: [ FI_RMA, FI_READ, FI_WRITE, FI_HMEM ]
mode: [ ]
op_flags: [ ]
msg_order: [ FI_ORDER_SAS ]
inject_size: 0
size: 1024
iov_limit: 4
rma_iov_limit: 4
tclass: 0x0
fi_rx_attr:
caps: [ FI_RMA, FI_REMOTE_WRITE, FI_MULTI_RECV, FI_RMA_EVENT, FI_SOURCE, FI_HMEM ]
mode: [ ]
op_flags: [ ]
msg_order: [ FI_ORDER_RAR, FI_ORDER_RAW, FI_ORDER_RAS, FI_ORDER_WAR, FI_ORDER_WAW, FI_ORDER_WAS, FI_ORDER_SAR, FI_ORDER_SAW, FI_ORDER_SAS, FI_ORDER_RMA_RAR, FI_ORDER_RMA_RAW, FI_ORDER_RMA_WAR, FI_ORDER_RMA_WAW, FI_ORDER_ATOMIC_RAR, FI_ORDER_ATOMIC_RAW, FI_ORDER_ATOMIC_WAR, FI_ORDER_ATOMIC_WAW ]
size: 1024
iov_limit: 4
fi_ep_attr:
type: FI_EP_RDM
protocol: FI_PROTO_SHM
protocol_version: 1
max_msg_size: 18446744073709551615
msg_prefix_size: 0
max_order_raw_size: 0
max_order_war_size: 0
max_order_waw_size: 0
mem_tag_format: 0xaaaaaaaaaaaaaaaa
tx_ctx_cnt: 1
rx_ctx_cnt: 1
auth_key_size: 0
fi_domain_attr:
domain: 0x0
name: shm
threading: FI_THREAD_SAFE
progress: FI_PROGRESS_MANUAL
resource_mgmt: FI_RM_ENABLED
av_type: FI_AV_UNSPEC
mr_mode: [ FI_MR_VIRT_ADDR, FI_MR_HMEM ]
mr_key_size: 8
cq_data_size: 8
cq_cnt: 1024
ep_cnt: 256
tx_ctx_cnt: 1024
rx_ctx_cnt: 1024
max_ep_tx_ctx: 1
max_ep_rx_ctx: 1
max_ep_stx_ctx: 0
max_ep_srx_ctx: 0
cntr_cnt: 0
mr_iov_limit: 4
caps: [ FI_LOCAL_COMM ]
mode: [ ]
auth_key_size: 0
max_err_data: 0
mr_cnt: 0
tclass: 0x0
fi_fabric_attr:
name: shm
prov_name: shm
prov_version: 202.0
api_version: 2.2
nic: (nil)
Any ideas?