Skip to content

[MPIX Stream] CUDA event compatible completion semantics #7612

@R0n12

Description

@R0n12

Summary

When using MPIX Stream enqueue APIs with MPICH ch4:ucx in a PyTorch ProcessGroup backend, we observe correctness failures for larger allreduce sizes unless we force a cudaStreamSynchronize after each MPIX enqueue. This strongly suggests the “completion” seen by the application (via a CUDA event recorded after enqueue) can precede the actual GPU work’s visibility/completion on the provided CUDA stream. We need an event-compatible completion mechanism so frameworks can remain non-blocking and preserve overlap.

Scenario

Use a CUDA stream S to

  1. enqueue an in-place allreduce via MPIX_Allreduce_enqueue(..., stream_comm) (created from S)
  2. record a CUDA event E on S immediately after enqueue
  3. later rely on E (or event-wait) for completion instead of cudaStreamSynchronize(S)

Proposal

A CUDA event recorded after enqueue on the same stream should become “ready” only after the underlying GPU work for the collective is finished and data is visible to subsequent consumers on that stream. This allows non-blocking frameworks to:

  • return immediately
  • compose precise cross-stream dependencies via cudaStreamWaitEvent
  • avoid stream-wide synchronizations

Why this matters

PyTorch, NCCL, and other GPU-accelerated frameworks rely on CUDA events to coordinate completion without stalling streams or the CPU.
For process-group backends, we need the comm library to either:

  • execute the GPU work on the provided CUDA stream and let us record a reliable end-event, or
  • provide an API to record/obtain a completion event itself.

Is this doable? Please let me know your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions