UCT/CUDA: Add support for flush_ep#10991
Conversation
WalkthroughAdds a CUDA-specific endpoint flush API that enqueues per-stream flush events tied to a shared flush descriptor, introduces flush event types and a flush descriptor pool, updates progress to treat flush events specially, and adds rollback and cleanup on allocation/enqueue failures. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant EP as Endpoint
participant IFACE as CUDA_Base_IFACE
participant Stream as GPU_Stream
participant Progress as Progress_Loop
Caller->>EP: ep_flush(comp)
activate EP
EP->>IFACE: uct_cuda_base_ep_flush()
activate IFACE
IFACE->>IFACE: alloc flush_desc (stream_counter = N)
loop per active stream (N)
IFACE->>Stream: enqueue per-stream flush event (flush_stream_desc)
end
EP-->>Caller: return (pending)
deactivate EP
deactivate IFACE
par Stream completions & progress polling
Stream->>IFACE: stream event triggers callback
activate IFACE
IFACE->>IFACE: uct_cuda_base_stream_flushed_cb() — decrement counter
alt counter == 0
IFACE->>Caller: complete user completion, free flush_desc
end
deactivate IFACE
Progress->>Progress: poll events (skip flush events)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/uct/cuda/base/cuda_iface.c(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/uct/cuda/base/cuda_iface.c (2)
src/ucs/debug/memtrack.c (2)
ucs_free(368-372)ucs_malloc(328-334)src/ucs/datastruct/mpool.c (7)
ucs_mpool_get(215-218)ucs_mpool_put(220-223)ucs_mpool_chunk_malloc(326-330)ucs_mpool_chunk_free(332-335)ucs_mpool_params_reset(60-73)ucs_mpool_init(81-147)ucs_mpool_cleanup(149-198)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (7)
src/uct/cuda/base/cuda_iface.c (7)
260-265: LGTM: Clean flush event detection.The function pointer comparison is an effective way to identify flush events. The dual check (non-NULL comp and matching function) ensures correctness.
275-290: LGTM: Correct flush event handling in progress path.The updated logic properly distinguishes flush events from regular events:
- Flush events are extracted immediately (serving as queue markers)
- Regular events are queried for completion before extraction
- Flush events skip
complete_eventinvocation (line 285-287) since they represent flush markers rather than actual CUDA operationsThe flush-as-marker design is sound: when a flush event reaches the front of the queue, all preceding operations are guaranteed complete.
451-457: LGTM: Appropriate mpool configuration for flush descriptors.Unlike regular event descriptors that require CUDA event initialization, flush descriptors are simple structures that use queue ordering for synchronization. NULL init/cleanup callbacks are correct here.
465-486: LGTM: Proper flush mpool initialization.The constructor correctly:
- Initializes mpool parameters with appropriate element size
- Checks initialization status and returns errors properly
- Follows the same pattern as event descriptor mpool initialization elsewhere in the codebase
502-502: LGTM: Proper flush mpool cleanup with leak checking.The destructor correctly cleans up the flush mpool with leak checking enabled, matching the cleanup pattern for other mpools in the codebase.
209-211: No issues found after verification.The NULL completion handling is correct and follows expected UCT semantics. When
comp == NULL, the function returnsUCS_INPROGRESSif there are active streams, indicating work is pending without a callback request—consistent with other async transports like UD. The mechanism is well-documented in the code comments explaining the flush event queue mechanism.
242-257: Rollback logic appears sound — no issues found.The verification confirms the implementation is correct:
ucs_queue_iter_elemmacro correctly uses container_of to retrieve queue descriptors from the iterator over the embeddedqueuefield- Tail element assumption is valid — the rollback retrieves exactly
stream_counterelements (matching successful pushes) from each queue; since push appends to tail and retrieval grabs tail, the most-recently-pushed item is correctly identified for rollback- Type cast is safe —
superis the first field ofuct_cuda_flush_stream_desc_t, so castinguct_cuda_event_desc_t*back to the parent type is validThe code relies on single-threaded enqueue semantics with no concurrent progress drains during the error path, which appears consistent with the endpoint context.
What?
Add support for flush_ep
Why?
Required for cuda_ipc to be used in gtests
Summary by CodeRabbit