-
Notifications
You must be signed in to change notification settings - Fork 43
transport: Deprecate enforce_cst #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
Conversation
| } | ||
| } | ||
|
|
||
| if ((peer_connectivity == NVSHMEMI_JOB_GPU_PROXY) && (enforce_cst)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enforce_cst_at_target API itself is not required for the reasons listed in your PR description, but we want to know when CST on the target side is required. This is really only the case for ibrc which is using a write as the signal for barrier and during team creation.
See the IBRC transport
The nvshmemi_device_state.job_connectivity value is used to check for that requirement and call the corresponding transfer api which triggers the proxy thread call. If we get rid of the cst op in the transport, we may need a flag in the transport information we communicate back to the library.
We don't want to make an extra proxy call unconditionally for transports like libfabrics and UCX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, restored enforce_cst_at_target, thanks!
| } | ||
| return; | ||
| } | ||
| #if defined(NVSHMEM_PPC64LE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine to remove since Power support is deprecated/removed.
CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed. Signed-off-by: Seth Zegelstein <[email protected]>
f57cd17 to
462478c
Compare
|
Also realized I could get rid of the local mr stuff in libfabric transport. |
| return status; | ||
| } | ||
|
|
||
| int nvshmemt_ibdevx_enforce_cst_at_target(struct nvshmem_transport *tcurr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one we need to keep. It never goes through the proxy and the other code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to keep this? It was only called at: transport->host_ops.enforce_cst = nvshmemt_ibdevx_enforce_cst_at_target;... i.e.. the enforce_cst proxy API that we are getting rid of. I think there are 1 of 2 possible errors.
- The function name was named incorrectly
- Forgot to assign
transport->host_ops.enforce_cst_at_target
| fence_handle fence; | ||
| quiet_handle quiet; | ||
| put_signal_handle put_signal; | ||
| int (*enforce_cst)(struct nvshmem_transport *transport); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I look at this, the more I feel like enforce_cst_at_target should be removed from the function table and turned into a flag (somewhere in the nvshmem_transport struct starting on line 187). We are breaking the host-lib to transport API this release anyway, so it would be nice to just rip the band-aid off all at once.
That method has the added bonus of reducing confusion for anyone implementing a custom transport plugin in the future. It will also allow you to remove the enforce_cst code from ibrc.cpp to match the rest of the transports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree that the flag makes more sense than the function table. It will be a little bit of time before I can go back and make this change.
seth-howell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more comments.
CUDA 11.3 released cuFlushGPUDirectRDMAWrites API which takes the place of the host transport enforce_cst api. NVSHMEM no longer supports CUDA 11, so these legacy API's can be removed.
This PR takes the place of #13