Skip to content

Conversation

@Artemy-Mellanox
Copy link
Contributor

@Artemy-Mellanox Artemy-Mellanox commented Nov 18, 2025

Summary by CodeRabbit

  • New Features
    • Multi-channel GDA support added with a new config option to set number of channels and channel-aware device operations.
  • Tests
    • Extended and updated tests and kernels to validate multi-channel behavior and per-channel routing.
  • Chores
    • Public device interfaces adjusted to accept channel identifiers (impacting callers/headers).

@coderabbitai
Copy link

coderabbitai bot commented Nov 18, 2025

Walkthrough

Adds multi-channel support to the MLX5 GDA path: refactors per-ep and device-ep layouts to per-channel QP/CQ blocks, threads channel_id through public device APIs and internal helpers, removes channel_id from the UCP request struct, and updates CUDA test kernels to route per-thread operations to channels.

Changes

Cohort / File(s) Change Summary
UCT Device API Signature Updates
src/uct/api/device/uct_device_impl.h
Added unsigned channel_id parameter to uct_device_ep_put_single, uct_device_ep_atomic_add, uct_device_ep_put_multi, uct_device_ep_put_multi_partial (positioned before flags); updated Doxygen.
UCP Request Structure Updates
src/ucp/api/device/ucp_device_impl.h
Removed public field unsigned channel_id from ucp_device_request_t; updated internal macro invocations to pass channel_id as an explicit argument.
MLX5 GDA Interface & Endpoint Refactor
src/uct/ib/mlx5/gdaki/gdaki.h, src/uct/ib/mlx5/gdaki/gdaki_dev.h
Added uct_rc_gdaki_channel_t and num_channels on iface; replaced ep-level sq_db with channels pointer; introduced uct_rc_gdaki_dev_qp_t and uct_rc_gdaki_dev_ep_t with flexible qps[0]; added channel_id to completion struct.
MLX5 GDA Core Multi-Channel Implementation
src/uct/ib/mlx5/gdaki/gdaki.c
Added num_channels config; changed dev-ep layout calc and get_device_ep to allocate per-channel CQ/QP and DBREC; updated address serialization, connect/is_connected, iface_query, init, and cleanup to handle multiple channels.
MLX5 GDA CUDA Helpers & API Entrypoints
src/uct/ib/mlx5/gdaki/gdaki.cuh
Propagated cid through many helpers (WQE/CQ/DBR/parse/reserve/prepare/post); changed signatures to accept unsigned cid; switched per-ep accesses to ep->qps[cid]; populate channel_id in completions.
UCP CUDA Test Kernel Changes
test/gtest/ucp/cuda/test_kernels.cu, test/gtest/ucp/cuda/test_kernels.h
Added unsigned num_channels to test_ucp_device_kernel_params_t; compute per-thread channel_id and pass it to ucp_device_put_single, ucp_device_put_multi, ucp_device_put_multi_partial; adjusted MLX5 completion accumulation to iterate channels.
UCP Device Test Mode
test/gtest/ucp/test_ucp_device.cc
Added MULTI_CHANNEL send mode and override init() to set UCX_RC_GDA_NUM_CHANNELS when used.
UCT CUDA Test Call Sites
test/gtest/uct/cuda/test_kernels.cu, test/gtest/uct/cuda/test_kernels_uct.cu
Updated kernel host launches / device call sites to pass an extra channel_id argument (often 0 in tests) before flags/completion parameters.

Sequence Diagram(s)

sequenceDiagram
    participant Kernel as CUDA Kernel
    participant API as UCP/UCT API
    participant GDA as GDA Core
    participant QP as Per-Channel QP

    Note over Kernel,GDA: Before (single-channel)
    Kernel->>API: put_single(addr,rkey,data)
    API->>GDA: route to ep (no cid)
    GDA->>QP: access ep->qp (global)

    Note over Kernel,GDA: After (multi-channel)
    Kernel->>Kernel: compute channel_id
    Kernel->>API: put_single(addr,rkey,data,channel_id)
    API->>GDA: invoke with channel context
    GDA->>QP: access ep->qps[channel_id]
    QP->>QP: per-channel WQE/CQ/DBR operations
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Pay special attention to:
    • src/uct/ib/mlx5/gdaki/gdaki.cuh — pervasive signature changes and per-cid indexing (WQE/CQ/DBR/reserve/parse).
    • src/uct/ib/mlx5/gdaki/gdaki.c — allocation/cleanup paths, address packing/unpacking across channels, connection logic per-channel.
    • src/uct/ib/mlx5/gdaki/gdaki_dev.h — flexible array layout (qps[0]) and structure offsets/alignments.
    • Test updates (test/gtest/ucp/cuda/*, test/gtest/uct/cuda/*) — ensure channel_id computation and calling conventions match API changes.

Possibly related PRs

Suggested reviewers

  • ofirfarjun7

Poem

🐇 In rows of queues I hop and play,

Channels multiply the traffic's way,
Per-CID hops, each WQE a beat,
Many little paths make transfer sweet,
Hooray — more lanes to bound and sway!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main objective of the PR: implementing channel_id support for the UCT/GDA (User-level Communication Transport / GPU Direct Async) subsystem. The changes consistently add channel_id parameters across multiple device API functions and introduce multi-channel infrastructure throughout the codebase.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

294-338: Fix unsafe use of completion pointer when it may be nullptr

In uct_rc_mlx5_gda_ep_single, uct_rc_mlx5_gda_ep_put_multi, and uct_rc_mlx5_gda_ep_put_multi_partial, the code unconditionally does:

  • uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;

and later checks if (comp != nullptr).

However, for the public UCP device APIs it is explicitly valid to call with req == nullptr, which leads to comp == nullptr being passed down to these UCT entry points (via ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING). When tl_comp is nullptr, taking &tl_comp->rc_gda is undefined and will cause device-side memory corruption. This breaks the documented “no-request / no-completion” fast path. Based on learnings.

You already have logic that handles the “no completion object, rely on FC only” case using comp == nullptr. The only missing piece is guarding the initial derivation of comp. Suggested fix:

@@ template<ucs_device_level_t level>
 UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_single(
         uct_rc_gdaki_dev_ep_t *ep, const uct_device_mem_element_t *tl_mem_elem,
         const void *address, uint32_t lkey, uint64_t remote_address,
         uint32_t rkey, size_t length, unsigned cid, uint64_t flags,
         uct_device_completion_t *tl_comp, uint32_t opcode, bool is_atomic,
         uint64_t add)
 {
-    uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;
+    uct_rc_gda_completion_t *comp = nullptr;
+    if (tl_comp != nullptr) {
+        comp = &tl_comp->rc_gda;
+    }
@@ template<ucs_device_level_t level>
 UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_put_multi(
         uct_device_ep_h tl_ep, const uct_device_mem_element_t *tl_mem_list,
@@
-    uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;
+    uct_rc_gda_completion_t *comp = nullptr;
+    if (tl_comp != nullptr) {
+        comp = &tl_comp->rc_gda;
+    }
@@ template<ucs_device_level_t level>
 UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_put_multi_partial(
         uct_device_ep_h tl_ep, const uct_device_mem_element_t *tl_mem_list,
@@
-    uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;
+    uct_rc_gda_completion_t *comp = nullptr;
+    if (tl_comp != nullptr) {
+        comp = &tl_comp->rc_gda;
+    }

The existing if (comp != nullptr) guards in these functions will then work as intended for both “with request” and “no request” cases.

Also applies to: 340-372, 374-463, 465-558

test/gtest/ucp/cuda/test_kernels.h (1)

22-66: First init_params factory method fails to initialize num_channels, causing modulo-by-zero in kernel

The struct addition is incompletely integrated. The first init_params() at line 366 in test/gtest/ucp/test_ucp_device.cc uses zero-initialization (params = {}), which leaves num_channels at 0. When the kernel code executes channel_id = threadIdx.x % params.num_channels; (line 23, 26 in test_kernels.cu), this causes modulo-by-zero—undefined behavior or crash.

The second init_params() at line 468 correctly sets num_channels = 1 (then 32 for multi-channel), but the first one does not. This breaks all test methods that call the first init_params().

Fix: Add params.num_channels = 1; after line 371 in the first init_params() method.

🧹 Nitpick comments (4)
test/gtest/ucp/test_ucp_device.cc (2)

439-457: MULTI_CHANNEL variant wiring and init look sound; consider centralizing channel count constant

The new MULTI_CHANNEL send mode is correctly threaded through get_test_variants(), get_send_mode(), and the init() override; setting UCX_RC_GDA_NUM_CHANNELS before test_ucp_device::init() ensures the transport sees the config when the context/EPs are created.

To avoid future drift, consider defining a single constant for the multi‑channel count (e.g. static const unsigned MULTI_CHANNEL_COUNT = 32;) and using it both for the env var and for params.num_channels in init_params(). This keeps tests consistent if the desired channel count ever changes.

Also applies to: 461-466


468-480: Clarify MULTI_CHANNEL switch behavior; avoid implicit fallthrough ambiguity

In init_params() the MULTI_CHANNEL case sets params.num_channels = 32; and then falls through to NODELAY_WITH_REQ (no break;), so MULTI_CHANNEL currently behaves as “NODELAY_WITH_REQ + multi‑channel”.

If that coupling is intentional, consider making it explicit to avoid ambiguity and potential -Wimplicit-fallthrough warnings:

-    params.num_channels = 1;
+    params.num_channels = 1;
     switch (get_send_mode()) {
-    case MULTI_CHANNEL:
-        params.num_channels = 32;
-    case NODELAY_WITH_REQ:
-        params.with_no_delay = true;
-        params.with_request  = true;
-        break;
+    case MULTI_CHANNEL:
+        params.num_channels = 32;
+        params.with_no_delay = true;
+        params.with_request  = true;
+        break;
+    case NODELAY_WITH_REQ:
+        params.with_no_delay = true;
+        params.with_request  = true;
+        break;

Alternatively, if you prefer relying on fallthrough, adding an explicit /* fallthrough */ (or the project’s fallthrough macro) after params.num_channels = 32; would still document the intent and keep compilers quiet.

src/uct/api/device/uct_device_impl.h (1)

37-71: UCT single/atomic device APIs: channel_id integration looks correct

The new channel_id parameter is added in a consistent position (before flags) and correctly forwarded only to the RC_MLX5_GDA backend; CUDA IPC remains unchanged and simply ignores the channel. This keeps the API uniform without breaking existing CUDA behavior.

You may want to explicitly document that transports other than RC_MLX5_GDA currently ignore channel_id so callers don’t over-interpret it.

Also applies to: 86-117

test/gtest/uct/cuda/test_kernels_uct.cu (1)

100-113: UCT CUDA kernel tests updated consistently for channel_id

All test kernels now pass an explicit channel_id argument (0) in the correct position for uct_device_ep_put_single, uct_device_ep_atomic_add, uct_device_ep_put_multi, and uct_device_ep_put_multi_partial. This keeps the tests aligned with the new API without changing their semantics (still single-channel).

Once multi-channel support is more mature, consider extending these tests to exercise non-zero channel_id values as well.

Also applies to: 169-181, 225-243, 308-325

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b13cf9b and a3e0349.

📒 Files selected for processing (11)
  • src/ucp/api/device/ucp_device_impl.h (4 hunks)
  • src/uct/api/device/uct_device_impl.h (8 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki.c (16 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki.cuh (21 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki.h (1 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki_dev.h (2 hunks)
  • test/gtest/ucp/cuda/test_kernels.cu (3 hunks)
  • test/gtest/ucp/cuda/test_kernels.h (1 hunks)
  • test/gtest/ucp/test_ucp_device.cc (2 hunks)
  • test/gtest/uct/cuda/test_kernels.cu (4 hunks)
  • test/gtest/uct/cuda/test_kernels_uct.cu (4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T09:04:19.215Z
Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.215Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.

Applied to files:

  • src/ucp/api/device/ucp_device_impl.h
  • test/gtest/ucp/cuda/test_kernels.cu
🧬 Code graph analysis (4)
test/gtest/uct/cuda/test_kernels_uct.cu (1)
src/uct/api/device/uct_device_impl.h (2)
  • uct_device_ep_put_single (54-71)
  • uct_device_ep_atomic_add (102-117)
test/gtest/ucp/test_ucp_device.cc (1)
test/gtest/ucp/ucp_test.cc (6)
  • add_variant_values (488-499)
  • add_variant_values (488-490)
  • add_variant_values (501-509)
  • add_variant_values (501-503)
  • init (97-104)
  • init (97-97)
test/gtest/ucp/cuda/test_kernels.cu (2)
test/gtest/ucp/test_ucp_device.cc (6)
  • params (68-68)
  • params (388-394)
  • params (389-389)
  • params (396-409)
  • params (396-398)
  • params (468-496)
src/ucp/api/device/ucp_device_impl.h (2)
  • ucp_device_put_single (142-165)
  • ucp_device_put_multi (263-289)
src/uct/ib/mlx5/gdaki/gdaki.c (6)
src/ucs/debug/memtrack.c (2)
  • ucs_calloc (336-342)
  • ucs_free (368-372)
src/uct/ib/mlx5/dv/ib_mlx5_dv.c (2)
  • uct_ib_mlx5_devx_create_cq_common (558-643)
  • uct_ib_mlx5_devx_create_qp_common (127-272)
src/uct/ib/base/ib_verbs.h (2)
  • uct_ib_pack_uint24 (127-132)
  • uct_ib_unpack_uint24 (134-137)
src/uct/ib/mlx5/rc/rc_mlx5_devx.c (1)
  • uct_rc_mlx5_iface_common_devx_connect_qp (384-517)
src/uct/ib/rc/base/rc_iface.c (1)
  • uct_rc_iface_fill_attr (820-833)
src/uct/ib/mlx5/ib_mlx5.c (1)
  • uct_ib_mlx5_wq_calc_sizes (342-346)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (4)
src/ucp/api/device/ucp_device_impl.h (2)

262-289: Multi-element UCP device ops pass channel_id cleanly to UCT

ucp_device_put_multi and ucp_device_put_multi_partial now pass channel_id in front of flags to uct_device_ep_put_multi / _put_multi_partial, matching the new UCT API contract. The existing mem-list handling and req/comp initialization patterns are unchanged and still compatible with req == nullptr.

Also applies to: 345-376


141-165: Channel-aware UCP device single/atomic ops are wired correctly into UCT API—verification complete

The new channel_id parameter is consistently threaded from ucp_device_put_single / ucp_device_counter_inc into uct_device_ep_put_single / uct_device_ep_atomic_add with the expected argument ordering (position 6 for put_single, position 5 for atomic_add), while preserving the existing req == nullptr fast-path behavior via ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING.

All four test call sites (test/gtest/uct/cuda/test_kernels.cu:22, :57 and test/gtest/uct/cuda/test_kernels_uct.cu:110, :178) have been verified to use the correct parameter order and pass valid channel_id values. No functional issues spotted.

src/uct/api/device/uct_device_impl.h (1)

143-188: UCT multi/multi_partial APIs correctly propagate channel_id into MLX5 GDA path

For uct_device_ep_put_multi and uct_device_ep_put_multi_partial, the new channel_id is passed through to the GDA implementations while CUDA IPC continues to use only flags/comp. Argument ordering is consistent with UCP and the tests. No functional issues seen.

Also applies to: 220-269

src/uct/ib/mlx5/gdaki/gdaki.cuh (1)

20-28: Per-channel QP/CQ handling and WQE layout look coherent

The changes to use ep->qps[cid] for SQ/CQ/DBR state (sq_db, sq_num, sq_rsvd_index, sq_ready_index, sq_lock, cq_buff, qp_dbrec) plus the updated uct_rc_mlx5_gda_get_wqe_ptr and CQ parsing code cleanly separate per-channel state:

  • WQE addresses are computed as cid * sq_wqe_num + (wqe_idx & (sq_wqe_num - 1)), matching a contiguous layout of per-channel SQs.
  • CQ parsing and max-allocation logic now operate on per-channel CQs and reserved indices.
  • Doorbell/DBR updates and debug dumps use the correct per-channel QP and CQ buffers.
  • Completion checking uses comp->channel_id to select the correct QP for CQE parsing and error reporting.

Within those assumptions (identical sq_wqe_num across channels, valid cid indices), the multi-channel plumbing looks consistent.

Also applies to: 30-56, 98-138, 140-203, 204-241, 260-285, 287-292, 560-575

Comment on lines 226 to 230
for (i = 0; i < iface->num_channels; i++) {
(void)cuMemHostUnregister(self->channels[i].sq_db);
uct_ib_mlx5_devx_destroy_qp_common(&self->channels[i].qp.super);
uct_ib_mlx5_devx_destroy_cq_common(&self->channels[i].cq);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Use the host pointer when unregistering doorbells

channel->sq_db holds the device pointer returned by cuMemHostGetDevicePointer(), but cuMemHostUnregister() expects the original host pointer. Passing the device pointer silently fails (CUDA_ERROR_INVALID_VALUE), so the doorbell stays registered and the next cuMemHostRegister() on this address will fail. Please guard on sq_db != NULL and unregister via channel->qp.reg->addr.ptr (same applies to the cleanup paths).

-        (void)cuMemHostUnregister(self->channels[i].sq_db);
+        if (self->channels[i].sq_db != NULL) {
+            (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
+        }
@@
-    (void)cuMemHostUnregister(self->channels[i].sq_db);
+    if (self->channels[i].sq_db != NULL) {
+        (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
+    }
@@
-        (void)cuMemHostUnregister(self->channels[i].sq_db);
+        if (self->channels[i].sq_db != NULL) {
+            (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
+        }

Also applies to: 199-209

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/uct/ib/mlx5/gdaki/gdaki.h (1)

20-28: Per-channel layout and EP state look consistent

The new num_channels field on uct_rc_gdaki_iface_t and the per-EP uct_rc_gdaki_channel_t *channels pointer align with the multi-channel design in gdaki.c (per-channel CQ/QP arrays, per-channel connection and cleanup). Struct wiring and ownership look correct; lifetime and cleanup are handled in the EP ctor/dtor.

Note that all uses assume num_channels >= 1 (e.g., ep_is_connected dereferences channels[0]); see my comment in gdaki.c suggesting we reject NUM_CHANNELS=0 at iface init time to avoid UB.

Also applies to: 31-35, 37-44

🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki.c (1)

73-85: Layout helper is correct; tiny readability nit in the comma expression

The new uct_rc_gdaki_calc_dev_ep_layout() nicely centralizes the device-EP layout and is used consistently from:

  • Line 136–137: EP ctor to size the DevX umem and WQ offsets.
  • Line 415–416: uct_rc_gdaki_ep_get_device_ep() to recompute dev_ep_size and the header size (qp_attr.umem_offset) for host staging.

The asserts on sizeof(uct_rc_gdaki_dev_ep_t) == 64 and sizeof(uct_rc_gdaki_dev_qp_t) == 128 are a good protection against drift with the CUDA side.

One minor readability nit: Line 81 currently uses a comma expression:

*cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t),
qp_attr->umem_offset = *cq_umem_offset_p +
                       sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;

This is legal C, but non-idiomatic and easy to misread as a typo. Consider splitting into two statements for clarity:

-    *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t),
-    qp_attr->umem_offset = *cq_umem_offset_p +
-                           sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;
+    *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t);
+    qp_attr->umem_offset = *cq_umem_offset_p +
+                           sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;

Functionally it’s the same, but much clearer to future readers.

Also applies to: 126-137, 136-137, 415-416

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a3e0349 and e8aa646.

📒 Files selected for processing (2)
  • src/uct/ib/mlx5/gdaki/gdaki.c (16 hunks)
  • src/uct/ib/mlx5/gdaki/gdaki.h (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/uct/ib/mlx5/gdaki/gdaki.c (5)
src/ucs/debug/memtrack.c (2)
  • ucs_calloc (336-342)
  • ucs_free (368-372)
src/uct/ib/mlx5/dv/ib_mlx5_dv.c (4)
  • uct_ib_mlx5_devx_create_cq_common (558-643)
  • uct_ib_mlx5_devx_create_qp_common (127-272)
  • uct_ib_mlx5_devx_destroy_cq_common (695-699)
  • uct_ib_mlx5_devx_destroy_qp_common (410-413)
src/uct/ib/base/ib_verbs.h (2)
  • uct_ib_pack_uint24 (127-132)
  • uct_ib_unpack_uint24 (134-137)
src/uct/ib/rc/base/rc_iface.c (1)
  • uct_rc_iface_fill_attr (820-833)
src/uct/ib/mlx5/ib_mlx5.c (1)
  • uct_ib_mlx5_wq_calc_sizes (342-346)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (4)
src/uct/ib/mlx5/gdaki/gdaki.c (4)

87-214: Per-channel CQ/QP allocation and cleanup paths look solid

The EP constructor/dtor changes for per-channel resources are internally consistent:

  • Line 157: dbrec.mem_id is set from the DevX umem.
  • Line 159–163: self->channels is allocated with iface->num_channels entries and properly checked for OOM.
  • Line 166–193: For each channel,
    • channel->cq.devx.mem.mem and channel->qp.super.devx.mem.mem share the same DevX umem as self->umem.
    • dbrec.offset is set to qps[i].cq_dbrec / qps[i].qp_dbrec and passed into the common DevX create helpers.
    • cq_attr.umem_offset and qp_attr.umem_offset are advanced per-channel so each CQ/QP uses distinct regions in the shared umem.
  • Error paths:
    • On CQ creation failure, you jump to err_qp and only destroy previously created QPs/CQs.
    • On QP creation failure, you destroy the just-created CQ for that channel (label err_cq), then unwind earlier channels in err_qp, and finally free channels, dereg the umem, and free GPU memory.

The EP destructor at lines 216–228 complements this properly:

  • Iterates over iface->num_channels and, per channel, unregisters BF host memory, then destroys QP and CQ, then frees self->channels and the underlying DevX umem and GPU buffer.

Overall, the lifetime and cleanup of per-channel CQs/QPs and the backing umem look correct and leak-free in both success and failure cases.

Also applies to: 216-228


236-250: Multi-channel EP addressing and connect flow are consistent

The new addressing and connect logic lines up correctly:

  • Line 240–248 (uct_rc_gdaki_ep_get_address):
    • Uses ucs_serialize_next() to walk a raw buffer and uct_ib_pack_uint24() to emit one 24-bit qp_num per channel.
  • Line 359–360 (uct_rc_gdaki_iface_query):
    • Sets ep_addr_len = sizeof(uct_ib_uint24_t) * iface->num_channels, which matches exactly what ep_get_address packs.
  • Line 265–295 (uct_rc_gdaki_ep_connect_to_ep_v2):
    • Mirrors the packing side by repeatedly calling ucs_serialize_next() on ep_addr and uct_ib_unpack_uint24() to recover per-channel destination QP numbers.
    • Connects each local ep->channels[i].qp.super to its corresponding remote QP via uct_rc_mlx5_iface_common_devx_connect_qp() with the same AH/path MTU used previously.

This yields a clean, symmetric mapping from packed qp_num[0..num_channels-1] to per-channel QPs and keeps the address length consistent with the configuration.

No functional issues spotted here.

Also applies to: 259-297, 359-360


392-479: Device-EP creation path matches the new layout and fixes the prior OOM-status bug

The multi-channel uct_rc_gdaki_ep_get_device_ep() changes look correct and address the earlier review concern:

  • Lines 412–416: Rebuild qp_attr and call uct_rc_gdaki_calc_dev_ep_layout() with iface->num_channels to get consistent cq_umem_offset, dev_ep_size, and qp_attr.umem_offset (header size).
  • Lines 418–422: Allocate a host staging buffer of size qp_attr.umem_offset (dev_ep), and now correctly set status = UCS_ERR_NO_MEMORY before goto out_ctx when ucs_calloc() fails, so the caller does not see a spurious UCS_OK.
  • Lines 424–429: Zero the whole GPU-side dev_ep region via cuMemsetD8(ep->ep_gpu, dev_ep_size) before populating the header.
  • Lines 430–436: Fill common fields (atomic buffer, lkey, WQE count, FC mask, and sq_wqe_daddr pointing into GPU memory at qp_attr.umem_offset).
  • Lines 437–456: For each channel, register the BF region, obtain the device pointer for the doorbell, and program dev_ep->qps[i].sq_db, sq_num, and clear cq_buff.
  • Lines 458–463: Copy only the header (qp_attr.umem_offset bytes) from host dev_ep to device (ep->ep_gpu), leaving the WQ region as zeroed.
  • Lines 474–476: Error label out_free frees dev_ep in all failure paths after allocation, and the context is popped at out_ctx.

The flow is aligned with the layout helper and per-channel host-side setup in the EP ctor, and the important OOM / error-status semantics are now correct.

I don’t see further correctness issues in this path.


15-16: Includes and config wiring for multi-channel support are coherent

A few smaller but important wiring details look good:

  • Lines 15–16: Adding ucs/type/serialize.h and uct/ib/base/ib_verbs.h is appropriate for ucs_serialize_next() and the 24-bit pack/unpack helpers.
  • Lines 23–27 and 38–42: The new num_channels field in uct_rc_gdaki_iface_config_t and "NUM_CHANNELS" config table entry are consistent, with a sensible default of "1".
  • Lines 380–390: uct_rc_gdaki_create_cq() continues to disable regular CQs (type UCT_IB_MLX5_OBJ_TYPE_NULL), which is compatible with the per-channel DevX CQs created from the EP ctor.
  • Lines 634–635: self->num_channels = config->num_channels cleanly propagates the config into the iface instance and is used consistently across the file.

Apart from the need to reject NUM_CHANNELS=0 called out in my other comment, this plumbing looks correct.

Also applies to: 23-27, 38-42, 380-390, 634-635

ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64);
ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128);

*cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;

uint32_t *data_ptr = (uint32_t*)&cqe64->wqe_counter;
uint32_t data = READ_ONCE(*data_ptr);
uint64_t rsvd_idx = READ_ONCE(ep->sq_rsvd_index);
uct_rc_gdaki_dev_qp_t *qp = ep->qps + cid;
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: qps[cid]
Maybe add channel bounds check assertion.
*can use helper func.

uint32_t cqe_num;
uint16_t sq_wqe_num;
uint32_t sq_num;
uint8_t pad[12];
Copy link
Contributor

@ovidiusm ovidiusm Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this padding correct? I computed total size 124 bytes. Or is there internal padding around the lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move sq_num after sq_lock, to avoid a "hole"

params.num_channels = 1;
switch (get_send_mode()) {
case MULTI_CHANNEL:
params.num_channels = 32;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need break;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was intential fall-through, rest params from NODELAY_WITH_REQ

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so pls add comment


ucs_status_t status = uct_device_ep_put_single<UCS_DEVICE_LEVEL_THREAD>(
ep, mem_elem, va, rva, length, UCT_DEVICE_FLAG_NODELAY, &comp);
ep, mem_elem, va, rva, length, 0, UCT_DEVICE_FLAG_NODELAY, &comp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use channel_id also in uct tests?

Comment on lines +78 to +79
ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64);
ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCS_STATIC_ASSERT

goto err_cq;
}
for (i = 0; i < iface->num_channels; i++) {
channel = self->channels + i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: &self->channels[i];

Comment on lines +201 to +202
while (i > 0) {
i--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: while (i-- > 0)

unsigned i;

for (i = 0; i < iface->num_channels; i++) {
(void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check flag that it was registered/initialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page with UAR might be or might be not registered already so currently we just ignore errors. this may cause use-after-free if we release page which is used by another EP. need some tracking. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check dev_ep_init flag?

pi = uct_rc_mlx5_gda_parse_cqe(ep, cid, &wqe_cnt, &opcode);

if (pi < comp->wqe_idx) {
if ((int64_t)pi < (int64_t)comp->wqe_idx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we expect 64bit wraparound? why need to cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for first message wqe_idx will be 0 and initial pi is -1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, maybe worth adding comment


typedef struct {
uint64_t wqe_idx;
unsigned channel_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we add channel_id to ucp_device_progress_req (next pr because it can break api)?

for (i = 0; i < iface->num_channels; i++) {
channel = ep->channels + i;

(void)cuMemHostRegister(channel->qp.reg->addr.ptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to unregister in case of an error?

unsigned i;

uct_ib_pack_uint24(rc_addr->qp_num, ep->qp.super.qp_num);
for (i = 0; i < iface->num_channels; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we assume all peers use same #channels?
Maybe we should add #channels to the address to validate it is equal?

&iface->super, &ep->channels[i].qp.super, dest_qp_num, &ah_attr,
path_mtu, path_index, iface->super.super.config.max_rd_atomic);
if (status != UCS_OK) {
return status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous qps before the error remains connected, is it a problem?

ucs_offsetof(uct_rc_gdaki_iface_config_t, mlx5),
UCS_CONFIG_TYPE_TABLE(uct_rc_mlx5_common_config_table)},

{"NUM_CHANNELS", "1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to limit max val?

uint32_t sq_num;
uint16_t sq_fc_mask;

uint8_t pad[24];
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use __attribute__((aligned(X)) or alignas instead of manual padding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants