-
Notifications
You must be signed in to change notification settings - Fork 497
UCT/GDA: channel_id implementation #11013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UCT/GDA: channel_id implementation #11013
Conversation
WalkthroughAdds multi-channel support to the MLX5 GDA path: refactors per-ep and device-ep layouts to per-channel QP/CQ blocks, threads channel_id through public device APIs and internal helpers, removes channel_id from the UCP request struct, and updates CUDA test kernels to route per-thread operations to channels. Changes
Sequence Diagram(s)sequenceDiagram
participant Kernel as CUDA Kernel
participant API as UCP/UCT API
participant GDA as GDA Core
participant QP as Per-Channel QP
Note over Kernel,GDA: Before (single-channel)
Kernel->>API: put_single(addr,rkey,data)
API->>GDA: route to ep (no cid)
GDA->>QP: access ep->qp (global)
Note over Kernel,GDA: After (multi-channel)
Kernel->>Kernel: compute channel_id
Kernel->>API: put_single(addr,rkey,data,channel_id)
API->>GDA: invoke with channel context
GDA->>QP: access ep->qps[channel_id]
QP->>QP: per-channel WQE/CQ/DBR operations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
294-338: Fix unsafe use of completion pointer when it may be nullptrIn
uct_rc_mlx5_gda_ep_single,uct_rc_mlx5_gda_ep_put_multi, anduct_rc_mlx5_gda_ep_put_multi_partial, the code unconditionally does:
uct_rc_gda_completion_t *comp = &tl_comp->rc_gda;and later checks
if (comp != nullptr).However, for the public UCP device APIs it is explicitly valid to call with
req == nullptr, which leads tocomp == nullptrbeing passed down to these UCT entry points (viaucp_device_request_initandUCP_DEVICE_SEND_BLOCKING). Whentl_compisnullptr, taking&tl_comp->rc_gdais undefined and will cause device-side memory corruption. This breaks the documented “no-request / no-completion” fast path. Based on learnings.You already have logic that handles the “no completion object, rely on FC only” case using
comp == nullptr. The only missing piece is guarding the initial derivation ofcomp. Suggested fix:@@ template<ucs_device_level_t level> UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_single( uct_rc_gdaki_dev_ep_t *ep, const uct_device_mem_element_t *tl_mem_elem, const void *address, uint32_t lkey, uint64_t remote_address, uint32_t rkey, size_t length, unsigned cid, uint64_t flags, uct_device_completion_t *tl_comp, uint32_t opcode, bool is_atomic, uint64_t add) { - uct_rc_gda_completion_t *comp = &tl_comp->rc_gda; + uct_rc_gda_completion_t *comp = nullptr; + if (tl_comp != nullptr) { + comp = &tl_comp->rc_gda; + } @@ template<ucs_device_level_t level> UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_put_multi( uct_device_ep_h tl_ep, const uct_device_mem_element_t *tl_mem_list, @@ - uct_rc_gda_completion_t *comp = &tl_comp->rc_gda; + uct_rc_gda_completion_t *comp = nullptr; + if (tl_comp != nullptr) { + comp = &tl_comp->rc_gda; + } @@ template<ucs_device_level_t level> UCS_F_DEVICE ucs_status_t uct_rc_mlx5_gda_ep_put_multi_partial( uct_device_ep_h tl_ep, const uct_device_mem_element_t *tl_mem_list, @@ - uct_rc_gda_completion_t *comp = &tl_comp->rc_gda; + uct_rc_gda_completion_t *comp = nullptr; + if (tl_comp != nullptr) { + comp = &tl_comp->rc_gda; + }The existing
if (comp != nullptr)guards in these functions will then work as intended for both “with request” and “no request” cases.Also applies to: 340-372, 374-463, 465-558
test/gtest/ucp/cuda/test_kernels.h (1)
22-66: First init_params factory method fails to initialize num_channels, causing modulo-by-zero in kernelThe struct addition is incompletely integrated. The first
init_params()at line 366 intest/gtest/ucp/test_ucp_device.ccuses zero-initialization (params = {}), which leavesnum_channelsat 0. When the kernel code executeschannel_id = threadIdx.x % params.num_channels;(line 23, 26 intest_kernels.cu), this causes modulo-by-zero—undefined behavior or crash.The second
init_params()at line 468 correctly setsnum_channels = 1(then 32 for multi-channel), but the first one does not. This breaks all test methods that call the firstinit_params().Fix: Add
params.num_channels = 1;after line 371 in the firstinit_params()method.
🧹 Nitpick comments (4)
test/gtest/ucp/test_ucp_device.cc (2)
439-457: MULTI_CHANNEL variant wiring and init look sound; consider centralizing channel count constantThe new MULTI_CHANNEL send mode is correctly threaded through
get_test_variants(),get_send_mode(), and theinit()override; settingUCX_RC_GDA_NUM_CHANNELSbeforetest_ucp_device::init()ensures the transport sees the config when the context/EPs are created.To avoid future drift, consider defining a single constant for the multi‑channel count (e.g.
static const unsigned MULTI_CHANNEL_COUNT = 32;) and using it both for the env var and forparams.num_channelsininit_params(). This keeps tests consistent if the desired channel count ever changes.Also applies to: 461-466
468-480: Clarify MULTI_CHANNEL switch behavior; avoid implicit fallthrough ambiguityIn
init_params()theMULTI_CHANNELcase setsparams.num_channels = 32;and then falls through toNODELAY_WITH_REQ(nobreak;), so MULTI_CHANNEL currently behaves as “NODELAY_WITH_REQ + multi‑channel”.If that coupling is intentional, consider making it explicit to avoid ambiguity and potential
-Wimplicit-fallthroughwarnings:- params.num_channels = 1; + params.num_channels = 1; switch (get_send_mode()) { - case MULTI_CHANNEL: - params.num_channels = 32; - case NODELAY_WITH_REQ: - params.with_no_delay = true; - params.with_request = true; - break; + case MULTI_CHANNEL: + params.num_channels = 32; + params.with_no_delay = true; + params.with_request = true; + break; + case NODELAY_WITH_REQ: + params.with_no_delay = true; + params.with_request = true; + break;Alternatively, if you prefer relying on fallthrough, adding an explicit
/* fallthrough */(or the project’s fallthrough macro) afterparams.num_channels = 32;would still document the intent and keep compilers quiet.src/uct/api/device/uct_device_impl.h (1)
37-71: UCT single/atomic device APIs: channel_id integration looks correctThe new
channel_idparameter is added in a consistent position (beforeflags) and correctly forwarded only to the RC_MLX5_GDA backend; CUDA IPC remains unchanged and simply ignores the channel. This keeps the API uniform without breaking existing CUDA behavior.You may want to explicitly document that transports other than RC_MLX5_GDA currently ignore
channel_idso callers don’t over-interpret it.Also applies to: 86-117
test/gtest/uct/cuda/test_kernels_uct.cu (1)
100-113: UCT CUDA kernel tests updated consistently for channel_idAll test kernels now pass an explicit
channel_idargument (0) in the correct position foruct_device_ep_put_single,uct_device_ep_atomic_add,uct_device_ep_put_multi, anduct_device_ep_put_multi_partial. This keeps the tests aligned with the new API without changing their semantics (still single-channel).Once multi-channel support is more mature, consider extending these tests to exercise non-zero
channel_idvalues as well.Also applies to: 169-181, 225-243, 308-325
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
src/ucp/api/device/ucp_device_impl.h(4 hunks)src/uct/api/device/uct_device_impl.h(8 hunks)src/uct/ib/mlx5/gdaki/gdaki.c(16 hunks)src/uct/ib/mlx5/gdaki/gdaki.cuh(21 hunks)src/uct/ib/mlx5/gdaki/gdaki.h(1 hunks)src/uct/ib/mlx5/gdaki/gdaki_dev.h(2 hunks)test/gtest/ucp/cuda/test_kernels.cu(3 hunks)test/gtest/ucp/cuda/test_kernels.h(1 hunks)test/gtest/ucp/test_ucp_device.cc(2 hunks)test/gtest/uct/cuda/test_kernels.cu(4 hunks)test/gtest/uct/cuda/test_kernels_uct.cu(4 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-06T09:04:19.215Z
Learnt from: iyastreb
Repo: openucx/ucx PR: 10906
File: src/tools/perf/cuda/ucp_cuda_kernel.cu:70-91
Timestamp: 2025-11-06T09:04:19.215Z
Learning: In UCX device API (src/ucp/api/device/ucp_device_impl.h), nullptr is a valid and supported value for the ucp_device_request_t* parameter in functions like ucp_device_put_single, ucp_device_put_multi, etc. This is an intentional performance optimization where operations are posted without per-request tracking overhead. The API explicitly handles nullptr in ucp_device_request_init and UCP_DEVICE_SEND_BLOCKING macro.
Applied to files:
src/ucp/api/device/ucp_device_impl.htest/gtest/ucp/cuda/test_kernels.cu
🧬 Code graph analysis (4)
test/gtest/uct/cuda/test_kernels_uct.cu (1)
src/uct/api/device/uct_device_impl.h (2)
uct_device_ep_put_single(54-71)uct_device_ep_atomic_add(102-117)
test/gtest/ucp/test_ucp_device.cc (1)
test/gtest/ucp/ucp_test.cc (6)
add_variant_values(488-499)add_variant_values(488-490)add_variant_values(501-509)add_variant_values(501-503)init(97-104)init(97-97)
test/gtest/ucp/cuda/test_kernels.cu (2)
test/gtest/ucp/test_ucp_device.cc (6)
params(68-68)params(388-394)params(389-389)params(396-409)params(396-398)params(468-496)src/ucp/api/device/ucp_device_impl.h (2)
ucp_device_put_single(142-165)ucp_device_put_multi(263-289)
src/uct/ib/mlx5/gdaki/gdaki.c (6)
src/ucs/debug/memtrack.c (2)
ucs_calloc(336-342)ucs_free(368-372)src/uct/ib/mlx5/dv/ib_mlx5_dv.c (2)
uct_ib_mlx5_devx_create_cq_common(558-643)uct_ib_mlx5_devx_create_qp_common(127-272)src/uct/ib/base/ib_verbs.h (2)
uct_ib_pack_uint24(127-132)uct_ib_unpack_uint24(134-137)src/uct/ib/mlx5/rc/rc_mlx5_devx.c (1)
uct_rc_mlx5_iface_common_devx_connect_qp(384-517)src/uct/ib/rc/base/rc_iface.c (1)
uct_rc_iface_fill_attr(820-833)src/uct/ib/mlx5/ib_mlx5.c (1)
uct_ib_mlx5_wq_calc_sizes(342-346)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (4)
src/ucp/api/device/ucp_device_impl.h (2)
262-289: Multi-element UCP device ops pass channel_id cleanly to UCT
ucp_device_put_multianducp_device_put_multi_partialnow passchannel_idin front offlagstouct_device_ep_put_multi/_put_multi_partial, matching the new UCT API contract. The existing mem-list handling andreq/compinitialization patterns are unchanged and still compatible withreq == nullptr.Also applies to: 345-376
141-165: Channel-aware UCP device single/atomic ops are wired correctly into UCT API—verification completeThe new
channel_idparameter is consistently threaded fromucp_device_put_single/ucp_device_counter_incintouct_device_ep_put_single/uct_device_ep_atomic_addwith the expected argument ordering (position 6 forput_single, position 5 foratomic_add), while preserving the existingreq == nullptrfast-path behavior viaucp_device_request_initandUCP_DEVICE_SEND_BLOCKING.All four test call sites (test/gtest/uct/cuda/test_kernels.cu:22, :57 and test/gtest/uct/cuda/test_kernels_uct.cu:110, :178) have been verified to use the correct parameter order and pass valid channel_id values. No functional issues spotted.
src/uct/api/device/uct_device_impl.h (1)
143-188: UCT multi/multi_partial APIs correctly propagate channel_id into MLX5 GDA pathFor
uct_device_ep_put_multianduct_device_ep_put_multi_partial, the newchannel_idis passed through to the GDA implementations while CUDA IPC continues to use onlyflags/comp. Argument ordering is consistent with UCP and the tests. No functional issues seen.Also applies to: 220-269
src/uct/ib/mlx5/gdaki/gdaki.cuh (1)
20-28: Per-channel QP/CQ handling and WQE layout look coherentThe changes to use
ep->qps[cid]for SQ/CQ/DBR state (sq_db,sq_num,sq_rsvd_index,sq_ready_index,sq_lock,cq_buff,qp_dbrec) plus the updateduct_rc_mlx5_gda_get_wqe_ptrand CQ parsing code cleanly separate per-channel state:
- WQE addresses are computed as
cid * sq_wqe_num + (wqe_idx & (sq_wqe_num - 1)), matching a contiguous layout of per-channel SQs.- CQ parsing and max-allocation logic now operate on per-channel CQs and reserved indices.
- Doorbell/DBR updates and debug dumps use the correct per-channel QP and CQ buffers.
- Completion checking uses
comp->channel_idto select the correct QP for CQE parsing and error reporting.Within those assumptions (identical
sq_wqe_numacross channels, validcidindices), the multi-channel plumbing looks consistent.Also applies to: 30-56, 98-138, 140-203, 204-241, 260-285, 287-292, 560-575
| for (i = 0; i < iface->num_channels; i++) { | ||
| (void)cuMemHostUnregister(self->channels[i].sq_db); | ||
| uct_ib_mlx5_devx_destroy_qp_common(&self->channels[i].qp.super); | ||
| uct_ib_mlx5_devx_destroy_cq_common(&self->channels[i].cq); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the host pointer when unregistering doorbells
channel->sq_db holds the device pointer returned by cuMemHostGetDevicePointer(), but cuMemHostUnregister() expects the original host pointer. Passing the device pointer silently fails (CUDA_ERROR_INVALID_VALUE), so the doorbell stays registered and the next cuMemHostRegister() on this address will fail. Please guard on sq_db != NULL and unregister via channel->qp.reg->addr.ptr (same applies to the cleanup paths).
- (void)cuMemHostUnregister(self->channels[i].sq_db);
+ if (self->channels[i].sq_db != NULL) {
+ (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
+ }
@@
- (void)cuMemHostUnregister(self->channels[i].sq_db);
+ if (self->channels[i].sq_db != NULL) {
+ (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
+ }
@@
- (void)cuMemHostUnregister(self->channels[i].sq_db);
+ if (self->channels[i].sq_db != NULL) {
+ (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr);
+ }Also applies to: 199-209
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/uct/ib/mlx5/gdaki/gdaki.h (1)
20-28: Per-channel layout and EP state look consistentThe new
num_channelsfield onuct_rc_gdaki_iface_tand the per-EPuct_rc_gdaki_channel_t *channelspointer align with the multi-channel design ingdaki.c(per-channel CQ/QP arrays, per-channel connection and cleanup). Struct wiring and ownership look correct; lifetime and cleanup are handled in the EP ctor/dtor.Note that all uses assume
num_channels >= 1(e.g.,ep_is_connecteddereferenceschannels[0]); see my comment ingdaki.csuggesting we rejectNUM_CHANNELS=0at iface init time to avoid UB.Also applies to: 31-35, 37-44
🧹 Nitpick comments (1)
src/uct/ib/mlx5/gdaki/gdaki.c (1)
73-85: Layout helper is correct; tiny readability nit in the comma expressionThe new
uct_rc_gdaki_calc_dev_ep_layout()nicely centralizes the device-EP layout and is used consistently from:
- Line 136–137: EP ctor to size the DevX umem and WQ offsets.
- Line 415–416:
uct_rc_gdaki_ep_get_device_ep()to recomputedev_ep_sizeand the header size (qp_attr.umem_offset) for host staging.The asserts on
sizeof(uct_rc_gdaki_dev_ep_t) == 64andsizeof(uct_rc_gdaki_dev_qp_t) == 128are a good protection against drift with the CUDA side.One minor readability nit: Line 81 currently uses a comma expression:
*cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t), qp_attr->umem_offset = *cq_umem_offset_p + sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;This is legal C, but non-idiomatic and easy to misread as a typo. Consider splitting into two statements for clarity:
- *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t), - qp_attr->umem_offset = *cq_umem_offset_p + - sizeof(uct_rc_gdaki_dev_qp_t) * num_channels; + *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t); + qp_attr->umem_offset = *cq_umem_offset_p + + sizeof(uct_rc_gdaki_dev_qp_t) * num_channels;Functionally it’s the same, but much clearer to future readers.
Also applies to: 126-137, 136-137, 415-416
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/uct/ib/mlx5/gdaki/gdaki.c(16 hunks)src/uct/ib/mlx5/gdaki/gdaki.h(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/uct/ib/mlx5/gdaki/gdaki.c (5)
src/ucs/debug/memtrack.c (2)
ucs_calloc(336-342)ucs_free(368-372)src/uct/ib/mlx5/dv/ib_mlx5_dv.c (4)
uct_ib_mlx5_devx_create_cq_common(558-643)uct_ib_mlx5_devx_create_qp_common(127-272)uct_ib_mlx5_devx_destroy_cq_common(695-699)uct_ib_mlx5_devx_destroy_qp_common(410-413)src/uct/ib/base/ib_verbs.h (2)
uct_ib_pack_uint24(127-132)uct_ib_unpack_uint24(134-137)src/uct/ib/rc/base/rc_iface.c (1)
uct_rc_iface_fill_attr(820-833)src/uct/ib/mlx5/ib_mlx5.c (1)
uct_ib_mlx5_wq_calc_sizes(342-346)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (4)
src/uct/ib/mlx5/gdaki/gdaki.c (4)
87-214: Per-channel CQ/QP allocation and cleanup paths look solidThe EP constructor/dtor changes for per-channel resources are internally consistent:
- Line 157:
dbrec.mem_idis set from the DevX umem.- Line 159–163:
self->channelsis allocated withiface->num_channelsentries and properly checked for OOM.- Line 166–193: For each channel,
channel->cq.devx.mem.memandchannel->qp.super.devx.mem.memshare the same DevX umem asself->umem.dbrec.offsetis set toqps[i].cq_dbrec/qps[i].qp_dbrecand passed into the common DevX create helpers.cq_attr.umem_offsetandqp_attr.umem_offsetare advanced per-channel so each CQ/QP uses distinct regions in the shared umem.- Error paths:
- On CQ creation failure, you jump to
err_qpand only destroy previously created QPs/CQs.- On QP creation failure, you destroy the just-created CQ for that channel (label
err_cq), then unwind earlier channels inerr_qp, and finally freechannels, dereg the umem, and free GPU memory.The EP destructor at lines 216–228 complements this properly:
- Iterates over
iface->num_channelsand, per channel, unregisters BF host memory, then destroys QP and CQ, then freesself->channelsand the underlying DevX umem and GPU buffer.Overall, the lifetime and cleanup of per-channel CQs/QPs and the backing umem look correct and leak-free in both success and failure cases.
Also applies to: 216-228
236-250: Multi-channel EP addressing and connect flow are consistentThe new addressing and connect logic lines up correctly:
- Line 240–248 (
uct_rc_gdaki_ep_get_address):
- Uses
ucs_serialize_next()to walk a raw buffer anduct_ib_pack_uint24()to emit one 24-bitqp_numper channel.- Line 359–360 (
uct_rc_gdaki_iface_query):
- Sets
ep_addr_len = sizeof(uct_ib_uint24_t) * iface->num_channels, which matches exactly whatep_get_addresspacks.- Line 265–295 (
uct_rc_gdaki_ep_connect_to_ep_v2):
- Mirrors the packing side by repeatedly calling
ucs_serialize_next()onep_addranduct_ib_unpack_uint24()to recover per-channel destination QP numbers.- Connects each local
ep->channels[i].qp.superto its corresponding remote QP viauct_rc_mlx5_iface_common_devx_connect_qp()with the same AH/path MTU used previously.This yields a clean, symmetric mapping from packed
qp_num[0..num_channels-1]to per-channel QPs and keeps the address length consistent with the configuration.No functional issues spotted here.
Also applies to: 259-297, 359-360
392-479: Device-EP creation path matches the new layout and fixes the prior OOM-status bugThe multi-channel
uct_rc_gdaki_ep_get_device_ep()changes look correct and address the earlier review concern:
- Lines 412–416: Rebuild
qp_attrand calluct_rc_gdaki_calc_dev_ep_layout()withiface->num_channelsto get consistentcq_umem_offset,dev_ep_size, andqp_attr.umem_offset(header size).- Lines 418–422: Allocate a host staging buffer of size
qp_attr.umem_offset(dev_ep), and now correctly setstatus = UCS_ERR_NO_MEMORYbeforegoto out_ctxwhenucs_calloc()fails, so the caller does not see a spuriousUCS_OK.- Lines 424–429: Zero the whole GPU-side dev_ep region via
cuMemsetD8(ep->ep_gpu, dev_ep_size)before populating the header.- Lines 430–436: Fill common fields (atomic buffer, lkey, WQE count, FC mask, and
sq_wqe_daddrpointing into GPU memory atqp_attr.umem_offset).- Lines 437–456: For each channel, register the BF region, obtain the device pointer for the doorbell, and program
dev_ep->qps[i].sq_db,sq_num, and clearcq_buff.- Lines 458–463: Copy only the header (
qp_attr.umem_offsetbytes) from hostdev_epto device (ep->ep_gpu), leaving the WQ region as zeroed.- Lines 474–476: Error label
out_freefreesdev_epin all failure paths after allocation, and the context is popped atout_ctx.The flow is aligned with the layout helper and per-channel host-side setup in the EP ctor, and the important OOM / error-status semantics are now correct.
I don’t see further correctness issues in this path.
15-16: Includes and config wiring for multi-channel support are coherentA few smaller but important wiring details look good:
- Lines 15–16: Adding
ucs/type/serialize.handuct/ib/base/ib_verbs.his appropriate forucs_serialize_next()and the 24-bit pack/unpack helpers.- Lines 23–27 and 38–42: The new
num_channelsfield inuct_rc_gdaki_iface_config_tand"NUM_CHANNELS"config table entry are consistent, with a sensible default of"1".- Lines 380–390:
uct_rc_gdaki_create_cq()continues to disable regular CQs (typeUCT_IB_MLX5_OBJ_TYPE_NULL), which is compatible with the per-channel DevX CQs created from the EP ctor.- Lines 634–635:
self->num_channels = config->num_channelscleanly propagates the config into the iface instance and is used consistently across the file.Apart from the need to reject
NUM_CHANNELS=0called out in my other comment, this plumbing looks correct.Also applies to: 23-27, 38-42, 380-390, 634-635
src/uct/ib/mlx5/gdaki/gdaki.c
Outdated
| ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64); | ||
| ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128); | ||
|
|
||
| *cq_umem_offset_p = sizeof(uct_rc_gdaki_dev_ep_t), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
;
src/uct/ib/mlx5/gdaki/gdaki.cuh
Outdated
| uint32_t *data_ptr = (uint32_t*)&cqe64->wqe_counter; | ||
| uint32_t data = READ_ONCE(*data_ptr); | ||
| uint64_t rsvd_idx = READ_ONCE(ep->sq_rsvd_index); | ||
| uct_rc_gdaki_dev_qp_t *qp = ep->qps + cid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: qps[cid]
Maybe add channel bounds check assertion.
*can use helper func.
src/uct/ib/mlx5/gdaki/gdaki_dev.h
Outdated
| uint32_t cqe_num; | ||
| uint16_t sq_wqe_num; | ||
| uint32_t sq_num; | ||
| uint8_t pad[12]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this padding correct? I computed total size 124 bytes. Or is there internal padding around the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move sq_num after sq_lock, to avoid a "hole"
75c5a3c to
eab5668
Compare
| params.num_channels = 1; | ||
| switch (get_send_mode()) { | ||
| case MULTI_CHANNEL: | ||
| params.num_channels = 32; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need break;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was intential fall-through, rest params from NODELAY_WITH_REQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so pls add comment
|
|
||
| ucs_status_t status = uct_device_ep_put_single<UCS_DEVICE_LEVEL_THREAD>( | ||
| ep, mem_elem, va, rva, length, UCT_DEVICE_FLAG_NODELAY, &comp); | ||
| ep, mem_elem, va, rva, length, 0, UCT_DEVICE_FLAG_NODELAY, &comp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use channel_id also in uct tests?
| ucs_assert(sizeof(uct_rc_gdaki_dev_ep_t) == 64); | ||
| ucs_assert(sizeof(uct_rc_gdaki_dev_qp_t) == 128); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCS_STATIC_ASSERT
| goto err_cq; | ||
| } | ||
| for (i = 0; i < iface->num_channels; i++) { | ||
| channel = self->channels + i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: &self->channels[i];
| while (i > 0) { | ||
| i--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: while (i-- > 0)
| unsigned i; | ||
|
|
||
| for (i = 0; i < iface->num_channels; i++) { | ||
| (void)cuMemHostUnregister(self->channels[i].qp.reg->addr.ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check flag that it was registered/initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page with UAR might be or might be not registered already so currently we just ignore errors. this may cause use-after-free if we release page which is used by another EP. need some tracking. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check dev_ep_init flag?
| pi = uct_rc_mlx5_gda_parse_cqe(ep, cid, &wqe_cnt, &opcode); | ||
|
|
||
| if (pi < comp->wqe_idx) { | ||
| if ((int64_t)pi < (int64_t)comp->wqe_idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we expect 64bit wraparound? why need to cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for first message wqe_idx will be 0 and initial pi is -1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, maybe worth adding comment
|
|
||
| typedef struct { | ||
| uint64_t wqe_idx; | ||
| unsigned channel_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we add channel_id to ucp_device_progress_req (next pr because it can break api)?
| for (i = 0; i < iface->num_channels; i++) { | ||
| channel = ep->channels + i; | ||
|
|
||
| (void)cuMemHostRegister(channel->qp.reg->addr.ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to unregister in case of an error?
| unsigned i; | ||
|
|
||
| uct_ib_pack_uint24(rc_addr->qp_num, ep->qp.super.qp_num); | ||
| for (i = 0; i < iface->num_channels; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we assume all peers use same #channels?
Maybe we should add #channels to the address to validate it is equal?
| &iface->super, &ep->channels[i].qp.super, dest_qp_num, &ah_attr, | ||
| path_mtu, path_index, iface->super.super.config.max_rd_atomic); | ||
| if (status != UCS_OK) { | ||
| return status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous qps before the error remains connected, is it a problem?
| ucs_offsetof(uct_rc_gdaki_iface_config_t, mlx5), | ||
| UCS_CONFIG_TYPE_TABLE(uct_rc_mlx5_common_config_table)}, | ||
|
|
||
| {"NUM_CHANNELS", "1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to limit max val?
| uint32_t sq_num; | ||
| uint16_t sq_fc_mask; | ||
|
|
||
| uint8_t pad[24]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use __attribute__((aligned(X)) or alignas instead of manual padding?
Summary by CodeRabbit