-
Notifications
You must be signed in to change notification settings - Fork 311
Description
Summary
When using MPIX Stream enqueue APIs with MPICH ch4:ucx in a PyTorch ProcessGroup backend, we observe correctness failures for larger allreduce sizes unless we force a cudaStreamSynchronize after each MPIX enqueue. This strongly suggests the “completion” seen by the application (via a CUDA event recorded after enqueue) can precede the actual GPU work’s visibility/completion on the provided CUDA stream. We need an event-compatible completion mechanism so frameworks can remain non-blocking and preserve overlap.
Scenario
Use a CUDA stream S to
- enqueue an in-place allreduce via MPIX_Allreduce_enqueue(..., stream_comm) (created from
S) - record a CUDA event
EonSimmediately after enqueue - later rely on
E(or event-wait) for completion instead of cudaStreamSynchronize(S)
Proposal
A CUDA event recorded after enqueue on the same stream should become “ready” only after the underlying GPU work for the collective is finished and data is visible to subsequent consumers on that stream. This allows non-blocking frameworks to:
- return immediately
- compose precise cross-stream dependencies via
cudaStreamWaitEvent - avoid stream-wide synchronizations
Why this matters
PyTorch, NCCL, and other GPU-accelerated frameworks rely on CUDA events to coordinate completion without stalling streams or the CPU.
For process-group backends, we need the comm library to either:
- execute the GPU work on the provided CUDA stream and let us record a reliable end-event, or
- provide an API to record/obtain a completion event itself.
Is this doable? Please let me know your thoughts!