Skip to content

Conversation

@a-szegel
Copy link
Contributor

Making this PR a Draft PR because I have not finished testing yet.

a-szegel and others added 4 commits October 16, 2025 23:37
CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place
of the host transport enforce_cst api. NVSHMEM no longer supports CUDA
11, so these legacy API's can be removed.

Signed-off-by: Seth Zegelstein <[email protected]>
The previous is_proxy variable equals qp_index. Change the name
everywhere for consistency.

Signed-off-by: Seth Zegelstein <[email protected]>
Attempt to request FI_PROGRESS_AUTO to see if the libfabric provider
supports it, if it doesn't fall back to FI_PROGRESS_MANUAL.
FI_PROGRESS_AUTO means that we do not need to call into the progress
engine for submitted operations to complete.  This means that we can
remove the host endpoint from the progress call, and we only need to
progress the host endpoint when user calls nvshmem_quiet() from the
host. This allows us to set the threading model as FI_THREAD_COMPELTION
because the host only progress the host EP, and the proxy only
progresses the proxy EP, leading to compliance with FI_THREAD_COMPLETION.
An edge case exists here where the user calls nvshmem_quiet() on the
host QP_IDX from a GPU kernel, but this is illegial because the user
shouldn't be calling QP API's on QP's not provided to them via the qp
creation API's.

This patch should offer a performance improvement because it reduces the
number of EP's that are progressed in the critical path, and it allows
the libfabric provider to reduce locking b/c of threading model
FI_THREAD_COMPLETION.

Signed-off-by: Seth Zegelstein <[email protected]>
This change implements multi-rail support for the libfabric host proxy
transport. The transport changes from having 1 domain with 2 EP's to
having 1 host domain on NIC 1 and one proxy domain per NIC. Splitting
the host EP and proxy EP into seperate domains was done for simplicity
of the code. Every domain resource (including AV) was bound on a 1-1
basis per EP so this change should be a functional no-op. In the future
when one implements the QP API on the libfabric host proxy transport, N
EP's per domain can be easily extended on this. This code uses a round
robin based load balancer to assign messages to NIC's.  One NIC will be
used for the entire operation call into the libfabric transport
(including put-signal), but not including messages that are segmented
due to size or MR boundaries.

The number of NIC's (domains) per PE are limited by the size of the struct
nvshmemt_libfabric_mem_handle_t. A new env variable
NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE controls the max number of NIC's per PE.

Thank you Justin for contributing an initial implementation of multi-rail
which I built on top of.

Co-authored-by: Justin Chui <[email protected]>
Signed-off-by: Seth Zegelstein <[email protected]>
@a-szegel
Copy link
Contributor Author

Was able to get all perftests running with this PR + #36 and #26

@a-szegel a-szegel marked this pull request as ready for review November 24, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant