Skip to content

Conversation

@a-szegel
Copy link
Contributor

CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed.

This PR takes the place of #13

}
}

if ((peer_connectivity == NVSHMEMI_JOB_GPU_PROXY) && (enforce_cst)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enforce_cst_at_target API itself is not required for the reasons listed in your PR description, but we want to know when CST on the target side is required. This is really only the case for ibrc which is using a write as the signal for barrier and during team creation.
See the IBRC transport
The nvshmemi_device_state.job_connectivity value is used to check for that requirement and call the corresponding transfer api which triggers the proxy thread call. If we get rid of the cst op in the transport, we may need a flag in the transport information we communicate back to the library.
We don't want to make an extra proxy call unconditionally for transports like libfabrics and UCX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, restored enforce_cst_at_target, thanks!

}
return;
}
#if defined(NVSHMEM_PPC64LE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine to remove since Power support is deprecated/removed.

CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place
of the host transport enforce_cst api. NVSHMEM no longer supports CUDA
11, so these legacy API's can be removed.

Signed-off-by: Seth Zegelstein <[email protected]>
@a-szegel
Copy link
Contributor Author

Also realized I could get rid of the local mr stuff in libfabric transport.

return status;
}

int nvshmemt_ibdevx_enforce_cst_at_target(struct nvshmem_transport *tcurr) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one we need to keep. It never goes through the proxy and the other code path.

Copy link
Contributor Author

@a-szegel a-szegel Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep this? It was only called at: transport->host_ops.enforce_cst = nvshmemt_ibdevx_enforce_cst_at_target;... i.e.. the enforce_cst proxy API that we are getting rid of. I think there are 1 of 2 possible errors.

  1. The function name was named incorrectly
  2. Forgot to assign transport->host_ops.enforce_cst_at_target

fence_handle fence;
quiet_handle quiet;
put_signal_handle put_signal;
int (*enforce_cst)(struct nvshmem_transport *transport);
Copy link
Collaborator

@seth-howell seth-howell Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I look at this, the more I feel like enforce_cst_at_target should be removed from the function table and turned into a flag (somewhere in the nvshmem_transport struct starting on line 187). We are breaking the host-lib to transport API this release anyway, so it would be nice to just rip the band-aid off all at once.
That method has the added bonus of reducing confusion for anyone implementing a custom transport plugin in the future. It will also allow you to remove the enforce_cst code from ibrc.cpp to match the rest of the transports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that the flag makes more sense than the function table. It will be a little bit of time before I can go back and make this change.

Copy link
Collaborator

@seth-howell seth-howell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more comments.

@a-szegel a-szegel changed the title transport: Deprecate enforce_cst and enforce_cst_at_target transport: Deprecate enforce_cst Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants