UCT/GDA: channel_id implementation #11013

Artemy-Mellanox · 2025-11-18T11:16:18Z

Summary by CodeRabbit

New Features
- Multi-channel GDA support added with a new config option to set number of channels and channel-aware device operations.
Tests
- Extended and updated tests and kernels to validate multi-channel behavior and per-channel routing.
Chores
- Public device interfaces adjusted to accept channel identifiers (impacting callers/headers).

coderabbitai · 2025-11-18T11:16:43Z

Walkthrough

Adds multi-channel support to the MLX5 GDA path: refactors per-ep and device-ep layouts to per-channel QP/CQ blocks, threads channel_id through public device APIs and internal helpers, removes channel_id from the UCP request struct, and updates CUDA test kernels to route per-thread operations to channels.

Changes

Cohort / File(s)	Change Summary
UCT Device API Signature Updates `src/uct/api/device/uct_device_impl.h`	Added `unsigned channel_id` parameter to `uct_device_ep_put_single`, `uct_device_ep_atomic_add`, `uct_device_ep_put_multi`, `uct_device_ep_put_multi_partial` (positioned before flags); updated Doxygen.
UCP Request Structure Updates `src/ucp/api/device/ucp_device_impl.h`	Removed public field `unsigned channel_id` from `ucp_device_request_t`; updated internal macro invocations to pass channel_id as an explicit argument.
MLX5 GDA Interface & Endpoint Refactor `src/uct/ib/mlx5/gdaki/gdaki.h`, `src/uct/ib/mlx5/gdaki/gdaki_dev.h`	Added `uct_rc_gdaki_channel_t` and `num_channels` on iface; replaced ep-level sq_db with `channels` pointer; introduced `uct_rc_gdaki_dev_qp_t` and `uct_rc_gdaki_dev_ep_t` with flexible `qps[0]`; added `channel_id` to completion struct.
MLX5 GDA Core Multi-Channel Implementation `src/uct/ib/mlx5/gdaki/gdaki.c`	Added `num_channels` config; changed dev-ep layout calc and get_device_ep to allocate per-channel CQ/QP and DBREC; updated address serialization, connect/is_connected, iface_query, init, and cleanup to handle multiple channels.
MLX5 GDA CUDA Helpers & API Entrypoints `src/uct/ib/mlx5/gdaki/gdaki.cuh`	Propagated `cid` through many helpers (WQE/CQ/DBR/parse/reserve/prepare/post); changed signatures to accept `unsigned cid`; switched per-ep accesses to `ep->qps[cid]`; populate `channel_id` in completions.
UCP CUDA Test Kernel Changes `test/gtest/ucp/cuda/test_kernels.cu`, `test/gtest/ucp/cuda/test_kernels.h`	Added `unsigned num_channels` to `test_ucp_device_kernel_params_t`; compute per-thread `channel_id` and pass it to `ucp_device_put_single`, `ucp_device_put_multi`, `ucp_device_put_multi_partial`; adjusted MLX5 completion accumulation to iterate channels.
UCP Device Test Mode `test/gtest/ucp/test_ucp_device.cc`	Added `MULTI_CHANNEL` send mode and override `init()` to set `UCX_RC_GDA_NUM_CHANNELS` when used.
UCT CUDA Test Call Sites `test/gtest/uct/cuda/test_kernels.cu`, `test/gtest/uct/cuda/test_kernels_uct.cu`	Updated kernel host launches / device call sites to pass an extra channel_id argument (often `0` in tests) before flags/completion parameters.

Sequence Diagram(s)

sequenceDiagram
    participant Kernel as CUDA Kernel
    participant API as UCP/UCT API
    participant GDA as GDA Core
    participant QP as Per-Channel QP

    Note over Kernel,GDA: Before (single-channel)
    Kernel->>API: put_single(addr,rkey,data)
    API->>GDA: route to ep (no cid)
    GDA->>QP: access ep->qp (global)

    Note over Kernel,GDA: After (multi-channel)
    Kernel->>Kernel: compute channel_id
    Kernel->>API: put_single(addr,rkey,data,channel_id)
    API->>GDA: invoke with channel context
    GDA->>QP: access ep->qps[channel_id]
    QP->>QP: per-channel WQE/CQ/DBR operations

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pay special attention to:
- src/uct/ib/mlx5/gdaki/gdaki.cuh — pervasive signature changes and per-cid indexing (WQE/CQ/DBR/reserve/parse).
- src/uct/ib/mlx5/gdaki/gdaki.c — allocation/cleanup paths, address packing/unpacking across channels, connection logic per-channel.
- src/uct/ib/mlx5/gdaki/gdaki_dev.h — flexible array layout (qps[0]) and structure offsets/alignments.
- Test updates (test/gtest/ucp/cuda/*, test/gtest/uct/cuda/*) — ensure channel_id computation and calling conventions match API changes.

Possibly related PRs

UCT/GDA: Collapsed CQ #10959 — modifies the MLX5 GDA path and device-ep/WQE helpers; strong overlap with multi-channel refactor.

Suggested reviewers

ofirfarjun7

Poem

🐇 In rows of queues I hop and play,

Channels multiply the traffic's way,
Per-CID hops, each WQE a beat,
Many little paths make transfer sweet,
Hooray — more lanes to bound and sway!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly reflects the main objective of the PR: implementing channel_id support for the UCT/GDA (User-level Communication Transport / GPU Direct Async) subsystem. The changes consistently add channel_id parameters across multiple device API functions and introduce multi-channel infrastructure throughout the codebase.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
294-338: Fix unsafe use of completion pointer when it may be nullptr

In uct_rc_mlx5_gda_ep_single, uct_rc_mlx5_gda_ep_put_multi, and uct_rc_mlx5_gda_ep_put_multi_partial, the code unconditionally does:

uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;

and later checks if (comp != nullptr).

However, for the public UCP device APIs it is explicitly valid to call with req == nullptr, which leads to comp == nullptr being passed down to these UCT entry points (via ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING). When tl_comp is nullptr, taking &tl_comp->rc_gda is undefined and will cause device-side memory corruption. This breaks the documented “no-request / no-completion” fast path. Based on learnings.

You already have logic that handles the “no completion object, rely on FC only” case using comp == nullptr. The only missing piece is guarding the initial derivation of comp. Suggested fix:
@@ template<ucs_device_level_t level>
 UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_single(
         uct_rc_gdaki_dev_ep_t *ep, const uct_device_mem_element_t *tl_mem_elem,
         const void *address, uint32_t lkey, uint64_t remote_address,
         uint32_t rkey, size_t length, unsigned cid, uint64_t flags,
         uct_device_completion_t *tl_comp, uint32_t opcode, bool is_atomic,
         uint64_t add)
 {
-    uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;
+    uct_rc_gda_completion_t *comp = nullptr;
+    if (tl_comp != nullptr) {
+        comp = &tl_comp->rc_gda;
+    }
@@ template<ucs_device_level_t level>
 UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_put_multi(
         uct_device_ep_h tl_ep, const uct_device_mem_element_t *tl_mem_list,
@@
-    uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;
+    uct_rc_gda_completion_t *comp = nullptr;
+    if (tl_comp != nullptr) {
+        comp = &tl_comp->rc_gda;
+    }
@@ template<ucs_device_level_t level>
 UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_put_multi_partial(
         uct_device_ep_h tl_ep, const uct_device_mem_element_t *tl_mem_list,
@@
-    uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;
+    uct_rc_gda_completion_t *comp = nullptr;
+    if (tl_comp != nullptr) {
+        comp = &tl_comp->rc_gda;
+    }
The existing if (comp != nullptr) guards in these functions will then work as intended for both “with request” and “no request” cases.

Also applies to: 340-372, 374-463, 465-558
test/gtest/ucp/cuda/test_kernels.h (1)

22-66: First init_params factory method fails to initialize num_channels, causing modulo-by-zero in kernel

The struct addition is incompletely integrated. The first init_params() at line 366 in test/gtest/ucp/test_ucp_device.cc uses zero-initialization (params = {}), which leaves num_channels at 0. When the kernel code executes channel_id = threadIdx.x % params.num_channels; (line 23, 26 in test_kernels.cu), this causes modulo-by-zero—undefined behavior or crash.

The second init_params() at line 468 correctly sets num_channels = 1 (then 32 for multi-channel), but the first one does not. This breaks all test methods that call the first init_params().

Fix: Add params.num_channels = 1; after line 371 in the first init_params() method.

🧹 Nitpick comments (4)

test/gtest/ucp/test_ucp_device.cc (2)
439-457: MULTI_CHANNEL variant wiring and init look sound; consider centralizing channel count constant

The new MULTI_CHANNEL send mode is correctly threaded through get_test_variants(), get_send_mode(), and the init() override; setting UCX_RC_GDA_NUM_CHANNELS before test_ucp_device::init() ensures the transport sees the config when the context/EPs are created.

To avoid future drift, consider defining a single constant for the multi‑channel count (e.g. static const unsigned MULTI_CHANNEL_COUNT = 32;) and using it both for the env var and for params.num_channels in init_params(). This keeps tests consistent if the desired channel count ever changes.

Also applies to: 461-466

468-480: Clarify MULTI_CHANNEL switch behavior; avoid implicit fallthrough ambiguity

In init_params() the MULTI_CHANNEL case sets params.num_channels = 32; and then falls through to NODELAY_WITH_REQ (no break;), so MULTI_CHANNEL currently behaves as “NODELAY_WITH_REQ + multi‑channel”.

If that coupling is intentional, consider making it explicit to avoid ambiguity and potential -Wimplicit-fallthrough warnings:
-    params.num_channels = 1;
+    params.num_channels = 1;
     switch (get_send_mode()) {
-    case MULTI_CHANNEL:
-        params.num_channels = 32;
-    case NODELAY_WITH_REQ:
-        params.with_no_delay = true;
-        params.with_request  = true;
-        break;
+    case MULTI_CHANNEL:
+        params.num_channels = 32;
+        params.with_no_delay = true;
+        params.with_request  = true;
+        break;
+    case NODELAY_WITH_REQ:
+        params.with_no_delay = true;
+        params.with_request  = true;
+        break;
Alternatively, if you prefer relying on fallthrough, adding an explicit /* fallthrough */ (or the project’s fallthrough macro) after params.num_channels = 32; would still document the intent and keep compilers quiet.
src/uct/api/device/uct_device_impl.h (1)

37-71: UCT single/atomic device APIs: channel_id integration looks correct

The new channel_id parameter is added in a consistent position (before flags) and correctly forwarded only to the RC_MLX5_GDA backend; CUDA IPC remains unchanged and simply ignores the channel. This keeps the API uniform without breaking existing CUDA behavior.

You may want to explicitly document that transports other than RC_MLX5_GDA currently ignore channel_id so callers don’t over-interpret it.

Also applies to: 86-117

test/gtest/uct/cuda/test_kernels_uct.cu (1)

100-113: UCT CUDA kernel tests updated consistently for channel_id

All test kernels now pass an explicit channel_id argument (0) in the correct position for uct_device_ep_put_single, uct_device_ep_atomic_add, uct_device_ep_put_multi, and uct_device_ep_put_multi_partial. This keeps the tests aligned with the new API without changing their semantics (still single-channel).

Once multi-channel support is more mature, consider extending these tests to exercise non-zero channel_id values as well.

Also applies to: 169-181, 225-243, 308-325

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b13cf9b and a3e0349.

📒 Files selected for processing (11)

src/ucp/api/device/ucp_device_impl.h (4 hunks)
src/uct/api/device/uct_device_impl.h (8 hunks)
src/uct/ib/mlx5/gdaki/gdaki.c (16 hunks)
src/uct/ib/mlx5/gdaki/gdaki.cuh (21 hunks)
src/uct/ib/mlx5/gdaki/gdaki.h (1 hunks)
src/uct/ib/mlx5/gdaki/gdaki_dev.h (2 hunks)
test/gtest/ucp/cuda/test_kernels.cu (3 hunks)
test/gtest/ucp/cuda/test_kernels.h (1 hunks)
test/gtest/ucp/test_ucp_device.cc (2 hunks)
test/gtest/uct/cuda/test_kernels.cu (4 hunks)
test/gtest/uct/cuda/test_kernels_uct.cu (4 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-06T09:04:19.215Z

Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.215Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.

Applied to files:

src/ucp/api/device/ucp_device_impl.h
test/gtest/ucp/cuda/test_kernels.cu

🧬 Code graph analysis (4)

test/gtest/uct/cuda/test_kernels_uct.cu (1)

src/uct/api/device/uct_device_impl.h (2)

uct_device_ep_put_single (54-71)

uct_device_ep_atomic_add (102-117)

test/gtest/ucp/test_ucp_device.cc (1)

test/gtest/ucp/ucp_test.cc (6)

add_variant_values (488-499)

add_variant_values (488-490)

add_variant_values (501-509)

add_variant_values (501-503)

init (97-104)

init (97-97)

test/gtest/ucp/cuda/test_kernels.cu (2)

test/gtest/ucp/test_ucp_device.cc (6)

params (68-68)

params (388-394)

params (389-389)

params (396-409)

params (396-398)

params (468-496)

src/ucp/api/device/ucp_device_impl.h (2)

ucp_device_put_single (142-165)

ucp_device_put_multi (263-289)

src/uct/ib/mlx5/gdaki/gdaki.c (6)

src/ucs/debug/memtrack.c (2)

ucs_calloc (336-342)

ucs_free (368-372)

src/uct/ib/mlx5/dv/ib_mlx5_dv.c (2)

uct_ib_mlx5_devx_create_cq_common (558-643)

uct_ib_mlx5_devx_create_qp_common (127-272)

src/uct/ib/base/ib_verbs.h (2)

uct_ib_pack_uint24 (127-132)

uct_ib_unpack_uint24 (134-137)

src/uct/ib/mlx5/rc/rc_mlx5_devx.c (1)

uct_rc_mlx5_iface_common_devx_connect_qp (384-517)

src/uct/ib/rc/base/rc_iface.c (1)

uct_rc_iface_fill_attr (820-833)

src/uct/ib/mlx5/ib_mlx5.c (1)

uct_ib_mlx5_wq_calc_sizes (342-346)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: UCX PR (Codestyle ctags check)
GitHub Check: UCX PR (Codestyle codespell check)
GitHub Check: UCX PR (Codestyle format code)
GitHub Check: UCX PR (Codestyle AUTHORS file update check)
GitHub Check: UCX PR (Codestyle commit title)
GitHub Check: UCX release DRP (Prepare CheckRelease)
GitHub Check: UCX release (Prepare CheckRelease)
GitHub Check: UCX snapshot (Prepare Check)

🔇 Additional comments (4)

src/ucp/api/device/ucp_device_impl.h (2)

262-289: Multi-element UCP device ops pass channel_id cleanly to UCT

ucp_device_put_multi and ucp_device_put_multi_partial now pass channel_id in front of flags to uct_device_ep_put_multi / _put_multi_partial, matching the new UCT API contract. The existing mem-list handling and req/comp initialization patterns are unchanged and still compatible with req == nullptr.

Also applies to: 345-376

141-165: Channel-aware UCP device single/atomic ops are wired correctly into UCT API—verification complete

The new channel_id parameter is consistently threaded from ucp_device_put_single / ucp_device_counter_inc into uct_device_ep_put_single / uct_device_ep_atomic_add with the expected argument ordering (position 6 for put_single, position 5 for atomic_add), while preserving the existing req == nullptr fast-path behavior via ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING.

All four test call sites (test/gtest/uct/cuda/test_kernels.cu:22, :57 and test/gtest/uct/cuda/test_kernels_uct.cu:110, :178) have been verified to use the correct parameter order and pass valid channel_id values. No functional issues spotted.

src/uct/api/device/uct_device_impl.h (1)

143-188: UCT multi/multi_partial APIs correctly propagate channel_id into MLX5 GDA path

For uct_device_ep_put_multi and uct_device_ep_put_multi_partial, the new channel_id is passed through to the GDA implementations while CUDA IPC continues to use only flags/comp. Argument ordering is consistent with UCP and the tests. No functional issues seen.

Also applies to: 220-269

src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

20-28: Per-channel QP/CQ handling and WQE layout look coherent

The changes to use ep->qps[cid] for SQ/CQ/DBR state (sq_db, sq_num, sq_rsvd_index, sq_ready_index, sq_lock, cq_buff, qp_dbrec) plus the updated uct_rc_mlx5_gda_get_wqe_ptr and CQ parsing code cleanly separate per-channel state:

WQE addresses are computed as cid * sq_wqe_num + (wqe_idx & (sq_wqe_num - 1)), matching a contiguous layout of per-channel SQs.

CQ parsing and max-allocation logic now operate on per-channel CQs and reserved indices.

Doorbell/DBR updates and debug dumps use the correct per-channel QP and CQ buffers.

Completion checking uses comp->channel_id to select the correct QP for CQE parsing and error reporting.

Within those assumptions (identical sq_wqe_num across channels, valid cid indices), the multi-channel plumbing looks consistent.

Also applies to: 30-56, 98-138, 140-203, 204-241, 260-285, 287-292, 560-575

coderabbitai · 2025-11-18T11:28:11Z

src/uct/ib/mlx5/gdaki/gdaki.c

+    for (i = 0; i < iface->num_channels; i++) {
+        (void)cuMemHostUnregister(self->channels[i].sq_db);
+        uct_ib_mlx5_devx_destroy_qp_common(&self->channels[i].qp.super);
+        uct_ib_mlx5_devx_destroy_cq_common(&self->channels[i].cq);
+    }


⚠️ Potential issue | 🔴 Critical

Use the host pointer when unregistering doorbells

channel->sq_db holds the device pointer returned by cuMemHostGetDevicePointer(), but cuMemHostUnregister() expects the original host pointer. Passing the device pointer silently fails (CUDA_ERROR_INVALID_VALUE), so the doorbell stays registered and the next cuMemHostRegister() on this address will fail. Please guard on sq_db != NULL and unregister via channel->qp.reg->addr.ptr (same applies to the cleanup paths).

- (void)cuMemHostUnregister(self->channels[i].sq_db); + if (self->channels[i].sq_db != NULL) { + (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr); + } @@ - (void)cuMemHostUnregister(self->channels[i].sq_db); + if (self->channels[i].sq_db != NULL) { + (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr); + } @@ - (void)cuMemHostUnregister(self->channels[i].sq_db); + if (self->channels[i].sq_db != NULL) { + (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr); + }

Also applies to: 199-209

src/uct/ib/mlx5/gdaki/gdaki.c

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/uct/ib/mlx5/gdaki/gdaki.h (1)

20-28: Per-channel layout and EP state look consistent

The new num_channels field on uct_rc_gdaki_iface_t and the per-EP uct_rc_gdaki_channel_t *channels pointer align with the multi-channel design in gdaki.c (per-channel CQ/QP arrays, per-channel connection and cleanup). Struct wiring and ownership look correct; lifetime and cleanup are handled in the EP ctor/dtor.

Note that all uses assume num_channels >= 1 (e.g., ep_is_connected dereferences channels[0]); see my comment in gdaki.c suggesting we reject NUM_CHANNELS=0 at iface init time to avoid UB.

Also applies to: 31-35, 37-44

🧹 Nitpick comments (1)

src/uct/ib/mlx5/gdaki/gdaki.c (1)
73-85: Layout helper is correct; tiny readability nit in the comma expression

The new uct_rc_gdaki_calc_dev_ep_layout() nicely centralizes the device-EP layout and is used consistently from:

Line 136–137: EP ctor to size the DevX umem and WQ offsets.

Line 415–416: uct_rc_gdaki_ep_get_device_ep() to recompute dev_ep_size and the header size (qp_attr.umem_offset) for host staging.

The asserts on sizeof(uct_rc_gdaki_dev_ep_t) == 64 and sizeof(uct_rc_gdaki_dev_qp_t) == 128 are a good protection against drift with the CUDA side.

One minor readability nit: Line 81 currently uses a comma expression:
*cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t),
qp_attr->umem_offset = *cq_umem_offset_p +
                       sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;
This is legal C, but non-idiomatic and easy to misread as a typo. Consider splitting into two statements for clarity:
-    *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t),
-    qp_attr->umem_offset = *cq_umem_offset_p +
-                           sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;
+    *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t);
+    qp_attr->umem_offset = *cq_umem_offset_p +
+                           sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;
Functionally it’s the same, but much clearer to future readers.

Also applies to: 126-137, 136-137, 415-416

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3e0349 and e8aa646.

📒 Files selected for processing (2)

src/uct/ib/mlx5/gdaki/gdaki.c (16 hunks)
src/uct/ib/mlx5/gdaki/gdaki.h (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/uct/ib/mlx5/gdaki/gdaki.c (5)

src/ucs/debug/memtrack.c (2)

ucs_calloc (336-342)

ucs_free (368-372)

src/uct/ib/mlx5/dv/ib_mlx5_dv.c (4)

uct_ib_mlx5_devx_create_cq_common (558-643)

uct_ib_mlx5_devx_create_qp_common (127-272)

uct_ib_mlx5_devx_destroy_cq_common (695-699)

uct_ib_mlx5_devx_destroy_qp_common (410-413)

src/uct/ib/base/ib_verbs.h (2)

uct_ib_pack_uint24 (127-132)

uct_ib_unpack_uint24 (134-137)

src/uct/ib/rc/base/rc_iface.c (1)

uct_rc_iface_fill_attr (820-833)

src/uct/ib/mlx5/ib_mlx5.c (1)

uct_ib_mlx5_wq_calc_sizes (342-346)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: UCX PR (Static_check Static checks)
GitHub Check: UCX PR (Codestyle ctags check)
GitHub Check: UCX PR (Codestyle codespell check)
GitHub Check: UCX PR (Codestyle format code)
GitHub Check: UCX PR (Codestyle AUTHORS file update check)
GitHub Check: UCX PR (Codestyle commit title)
GitHub Check: UCX release DRP (Prepare CheckRelease)
GitHub Check: UCX release (Prepare CheckRelease)
GitHub Check: UCX snapshot (Prepare Check)

🔇 Additional comments (4)

src/uct/ib/mlx5/gdaki/gdaki.c (4)

87-214: Per-channel CQ/QP allocation and cleanup paths look solid

The EP constructor/dtor changes for per-channel resources are internally consistent:

Line 157: dbrec.mem_id is set from the DevX umem.

Line 159–163: self->channels is allocated with iface->num_channels entries and properly checked for OOM.

Line 166–193: For each channel,

channel->cq.devx.mem.mem and channel->qp.super.devx.mem.mem share the same DevX umem as self->umem.

dbrec.offset is set to qps[i].cq_dbrec / qps[i].qp_dbrec and passed into the common DevX create helpers.

cq_attr.umem_offset and qp_attr.umem_offset are advanced per-channel so each CQ/QP uses distinct regions in the shared umem.

Error paths:

On CQ creation failure, you jump to err_qp and only destroy previously created QPs/CQs.

On QP creation failure, you destroy the just-created CQ for that channel (label err_cq), then unwind earlier channels in err_qp, and finally free channels, dereg the umem, and free GPU memory.

The EP destructor at lines 216–228 complements this properly:

Iterates over iface->num_channels and, per channel, unregisters BF host memory, then destroys QP and CQ, then frees self->channels and the underlying DevX umem and GPU buffer.

Overall, the lifetime and cleanup of per-channel CQs/QPs and the backing umem look correct and leak-free in both success and failure cases.

Also applies to: 216-228

236-250: Multi-channel EP addressing and connect flow are consistent

The new addressing and connect logic lines up correctly:

Line 240–248 (uct_rc_gdaki_ep_get_address):

Uses ucs_serialize_next() to walk a raw buffer and uct_ib_pack_uint24() to emit one 24-bit qp_num per channel.

Line 359–360 (uct_rc_gdaki_iface_query):

Sets ep_addr_len = sizeof(uct_ib_uint24_t) * iface->num_channels, which matches exactly what ep_get_address packs.

Line 265–295 (uct_rc_gdaki_ep_connect_to_ep_v2):

Mirrors the packing side by repeatedly calling ucs_serialize_next() on ep_addr and uct_ib_unpack_uint24() to recover per-channel destination QP numbers.

Connects each local ep->channels[i].qp.super to its corresponding remote QP via uct_rc_mlx5_iface_common_devx_connect_qp() with the same AH/path MTU used previously.

This yields a clean, symmetric mapping from packed qp_num[0..num_channels-1] to per-channel QPs and keeps the address length consistent with the configuration.

No functional issues spotted here.

Also applies to: 259-297, 359-360

392-479: Device-EP creation path matches the new layout and fixes the prior OOM-status bug

The multi-channel uct_rc_gdaki_ep_get_device_ep() changes look correct and address the earlier review concern:

Lines 412–416: Rebuild qp_attr and call uct_rc_gdaki_calc_dev_ep_layout() with iface->num_channels to get consistent cq_umem_offset, dev_ep_size, and qp_attr.umem_offset (header size).

Lines 418–422: Allocate a host staging buffer of size qp_attr.umem_offset (dev_ep), and now correctly set status = UCS_ERR_NO_MEMORY before goto out_ctx when ucs_calloc() fails, so the caller does not see a spurious UCS_OK.

Lines 424–429: Zero the whole GPU-side dev_ep region via cuMemsetD8(ep->ep_gpu, dev_ep_size) before populating the header.

Lines 430–436: Fill common fields (atomic buffer, lkey, WQE count, FC mask, and sq_wqe_daddr pointing into GPU memory at qp_attr.umem_offset).

Lines 437–456: For each channel, register the BF region, obtain the device pointer for the doorbell, and program dev_ep->qps[i].sq_db, sq_num, and clear cq_buff.

Lines 458–463: Copy only the header (qp_attr.umem_offset bytes) from host dev_ep to device (ep->ep_gpu), leaving the WQ region as zeroed.

Lines 474–476: Error label out_free frees dev_ep in all failure paths after allocation, and the context is popped at out_ctx.

The flow is aligned with the layout helper and per-channel host-side setup in the EP ctor, and the important OOM / error-status semantics are now correct.

I don’t see further correctness issues in this path.

15-16: Includes and config wiring for multi-channel support are coherent

A few smaller but important wiring details look good:

Lines 15–16: Adding ucs/type/serialize.h and uct/ib/base/ib_verbs.h is appropriate for ucs_serialize_next() and the 24-bit pack/unpack helpers.

Lines 23–27 and 38–42: The new num_channels field in uct_rc_gdaki_iface_config_t and "NUM_CHANNELS" config table entry are consistent, with a sensible default of "1".

Lines 380–390: uct_rc_gdaki_create_cq() continues to disable regular CQs (type UCT_IB_MLX5_OBJ_TYPE_NULL), which is compatible with the per-channel DevX CQs created from the EP ctor.

Lines 634–635: self->num_channels = config->num_channels cleanly propagates the config into the iface instance and is used consistently across the file.

Apart from the need to reject NUM_CHANNELS=0 called out in my other comment, this plumbing looks correct.

Also applies to: 23-27, 38-42, 380-390, 634-635

src/uct/ib/mlx5/gdaki/gdaki.c

ofirfarjun7 · 2025-11-20T11:27:40Z

src/uct/ib/mlx5/gdaki/gdaki.c

+    ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64);
+    ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128);
+
+    *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t),


ofirfarjun7 · 2025-11-24T16:13:31Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

-    uint32_t *data_ptr = (uint32_t*)&cqe64->wqe_counter;
-    uint32_t data      = READ_ONCE(*data_ptr);
-    uint64_t rsvd_idx  = READ_ONCE(ep->sq_rsvd_index);
+    uct_rc_gdaki_dev_qp_t *qp = ep->qps + cid;


Minor: qps[cid]
Maybe add channel bounds check assertion.
*can use helper func.

ovidiusm · 2025-11-26T10:37:49Z

src/uct/ib/mlx5/gdaki/gdaki_dev.h

-    uint32_t                     cqe_num;
-    uint16_t                     sq_wqe_num;
    uint32_t                     sq_num;
+    uint8_t                      pad[12];


Is this padding correct? I computed total size 124 bytes. Or is there internal padding around the lock?

maybe move sq_num after sq_lock, to avoid a "hole"

yosefe · 2025-11-27T14:21:31Z

test/gtest/ucp/test_ucp_device.cc

+        params.num_channels = 1;
        switch (get_send_mode()) {
+        case MULTI_CHANNEL:
+            params.num_channels = 32;


need break;

it was intential fall-through, rest params from NODELAY_WITH_REQ

so pls add comment

yosefe · 2025-11-27T14:22:08Z

test/gtest/uct/cuda/test_kernels.cu


    ucs_status_t status = uct_device_ep_put_single<UCS_DEVICE_LEVEL_THREAD>(
-            ep, mem_elem, va, rva, length, UCT_DEVICE_FLAG_NODELAY, &comp);
+            ep, mem_elem, va, rva, length, 0, UCT_DEVICE_FLAG_NODELAY, &comp);


maybe use channel_id also in uct tests?

yosefe · 2025-11-27T14:24:29Z

src/uct/ib/mlx5/gdaki/gdaki.c

+    ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64);
+    ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128);


UCS_STATIC_ASSERT

yosefe · 2025-11-27T14:25:16Z

src/uct/ib/mlx5/gdaki/gdaki.c

-        goto err_cq;
-    }
+    for (i = 0; i < iface->num_channels; i++) {
+        channel = self->channels + i;


minor: &self->channels[i];

yosefe · 2025-11-27T14:26:10Z

src/uct/ib/mlx5/gdaki/gdaki.c

+    while (i > 0) {
+        i--;


minor: while (i-- > 0)

yosefe · 2025-11-27T14:29:18Z

src/uct/ib/mlx5/gdaki/gdaki.c

+    unsigned i;
+
+    for (i = 0; i < iface->num_channels; i++) {
+        (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);


do we need to check flag that it was registered/initialized?

page with UAR might be or might be not registered already so currently we just ignore errors. this may cause use-after-free if we release page which is used by another EP. need some tracking. WDYT?

can we check dev_ep_init flag?

yosefe · 2025-11-27T14:38:33Z

src/uct/ib/mlx5/gdaki/gdaki.cuh

+    pi = uct_rc_mlx5_gda_parse_cqe(ep, cid, &wqe_cnt, &opcode);

-    if (pi < comp->wqe_idx) {
+    if ((int64_t)pi < (int64_t)comp->wqe_idx) {


so we expect 64bit wraparound? why need to cast?

for first message wqe_idx will be 0 and initial pi is -1

ok, maybe worth adding comment

yosefe · 2025-11-27T14:40:22Z

src/uct/ib/mlx5/gdaki/gdaki_dev.h


 typedef struct {
    uint64_t wqe_idx;
+    unsigned channel_id;


maybe we add channel_id to ucp_device_progress_req (next pr because it can break api)?

ofirfarjun7 · 2025-11-28T20:54:29Z

src/uct/ib/mlx5/gdaki/gdaki.c

+        for (i = 0; i < iface->num_channels; i++) {
+            channel = ep->channels + i;
+
+            (void)cuMemHostRegister(channel->qp.reg->addr.ptr,


Don't we need to unregister in case of an error?

ofirfarjun7 · 2025-11-28T20:59:00Z

src/uct/ib/mlx5/gdaki/gdaki.c

+    unsigned i;

-    uct_ib_pack_uint24(rc_addr->qp_num, ep->qp.super.qp_num);
+    for (i = 0; i < iface->num_channels; i++) {


So we assume all peers use same #channels?
Maybe we should add #channels to the address to validate it is equal?

ofirfarjun7 · 2025-11-28T21:05:56Z

src/uct/ib/mlx5/gdaki/gdaki.c

+                &iface->super, &ep->channels[i].qp.super, dest_qp_num, &ah_attr,
+                path_mtu, path_index, iface->super.super.config.max_rd_atomic);
+        if (status != UCS_OK) {
+            return status;


previous qps before the error remains connected, is it a problem?

ofirfarjun7 · 2025-11-28T21:08:29Z

src/uct/ib/mlx5/gdaki/gdaki.c

     ucs_offsetof(uct_rc_gdaki_iface_config_t, mlx5),
     UCS_CONFIG_TYPE_TABLE(uct_rc_mlx5_common_config_table)},

+    {"NUM_CHANNELS", "1",


Maybe we need to limit max val?

ofirfarjun7 · 2025-11-28T21:19:20Z

src/uct/ib/mlx5/gdaki/gdaki_dev.h

-    uint32_t                     sq_num;
    uint16_t                     sq_fc_mask;
+
+    uint8_t                      pad[24];


Can we use __attribute__((aligned(X)) or alignas instead of manual padding?

UCT/GDA: channel_id implementation

a3e0349

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

UCT/GDA: channel_id implementation - 2

e8aa646

coderabbitai bot reviewed Nov 18, 2025

View reviewed changes

src/uct/ib/mlx5/gdaki/gdaki.c Show resolved Hide resolved

ofirfarjun7 reviewed Nov 20, 2025

View reviewed changes

ofirfarjun7 requested a review from michal-shalev November 24, 2025 10:50

ofirfarjun7 reviewed Nov 24, 2025

View reviewed changes

ovidiusm reviewed Nov 26, 2025

View reviewed changes

Artemy-Mellanox added 2 commits November 27, 2025 03:31

UCT/GDA: channel_id implementation - 3

390d626

Merge remote-tracking branch 'origin/master' into topic/gda_channels

eab5668

Artemy-Mellanox force-pushed the topic/gda_channels branch from 75c5a3c to eab5668 Compare November 27, 2025 01:42

yosefe reviewed Nov 27, 2025

View reviewed changes

ofirfarjun7 reviewed Nov 28, 2025

View reviewed changes

		ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64);
		ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128);

UCT/GDA: channel_id implementation #11013

Are you sure you want to change the base?

UCT/GDA: channel_id implementation #11013

Uh oh!

Conversation

Artemy-Mellanox commented Nov 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofirfarjun7 Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ovidiusm Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofirfarjun7 Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Artemy-Mellanox commented Nov 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 18, 2025 •

edited

Loading

ofirfarjun7 Nov 24, 2025 •

edited

Loading

ovidiusm Nov 26, 2025 •

edited

Loading

ofirfarjun7 Nov 28, 2025 •

edited

Loading