Skip to content

Conversation

@shasson5
Copy link
Contributor

@shasson5 shasson5 commented Oct 22, 2025

What?

Add process namespace to cuda_ipc rkey

Why?

Support running multiple containers where same process id can be used in different containers

Summary by CodeRabbit

  • New Features

    • Added process namespace support to CUDA IPC memory operations for improved isolation and correctness in containerized environments.
  • Refactor

    • Updated internal caching and memory mapping infrastructure to track and propagate namespace context across IPC operations.
    • Extended data structures and function signatures to accommodate namespace information in memory key handling.

@shasson5 shasson5 added the WIP-DNM Work in progress / Do not review label Oct 22, 2025
@gleon99 gleon99 requested a review from amastbaum November 2, 2025 09:17
Comment on lines +296 to +301
if (!ucs_sys_ns_is_default(UCS_SYS_NS_TYPE_PID)) {
ext_rkey = (uct_cuda_ipc_extended_rkey_t*)packed;
ext_rkey->pid_ns = memh->pid_ns;

ucs_assert(!(getpid() & UCT_CUDA_IPC_RKEY_FLAG_PID_NS));
packed->pid |= UCT_CUDA_IPC_RKEY_FLAG_PID_NS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shasson5 IIUC if a new version sends to an older version, the receiver is going to use an incorrect pid, right?
Is it intended, or we assume there is no such scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's intended but it shouldn't cause a problem IMO

Comment on lines 28 to 29
/* Indicates whether PID NS is contained in rkey */
#define UCT_CUDA_IPC_RKEY_FLAG_PID_NS UCS_BIT(31)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd elaborate more about the way it affects the wire compatibility (or doesn't affect, and why)

amastbaum
amastbaum previously approved these changes Nov 3, 2025
@brminich brminich requested a review from rakhmets November 3, 2025 15:01
@shasson5 shasson5 removed the WIP-DNM Work in progress / Do not review label Nov 12, 2025
@shasson5 shasson5 requested review from gleon99 and yosefe November 12, 2025 12:51
@coderabbitai
Copy link

coderabbitai bot commented Nov 12, 2025

Walkthrough

The pull request extends CUDA IPC cache and memory management systems with process namespace (pid_ns) awareness. Changes include adding pid_ns fields to cache keys and memory handles, introducing an extended remote key type for backward compatibility, and updating all related function signatures and access paths to propagate and utilize namespace information.

Changes

Cohort / File(s) Summary
Cache and hash infrastructure
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c, src/uct/cuda/cuda_ipc/cuda_ipc_cache.h
Added pid_ns to cache key struct; updated hash function and equality comparisons to include pid_ns. Extended uct_cuda_ipc_get_remote_cache() and uct_cuda_ipc_unmap_memhandle() signatures to accept and propagate pid_ns through all remote cache lookup and insertion paths.
Remote key type definitions
src/uct/cuda/cuda_ipc/cuda_ipc_md.h
Introduced new uct_cuda_ipc_extended_rkey_t typedef containing base rkey and pid_ns field. Updated uct_cuda_ipc_unpacked_rkey_t to use extended rkey as super member. Added pid_ns field to uct_cuda_ipc_memh_t.
Memory domain implementation
src/uct/cuda/cuda_ipc/cuda_ipc_md.c
Added pid_ns support to rkey packing/unpacking with UCT_CUDA_IPC_RKEY_FLAG_PID_NS flag. Updated accessibility checks to use nested structure paths. Extended mem registration to capture pid_ns for memh structures when non-default. Maintains wire compatibility for default namespace.
Event descriptor
src/uct/cuda/cuda_ipc/cuda_ipc_iface.h
Added pid_ns field to uct_cuda_ipc_event_desc_t struct.
Endpoint mapping
src/uct/cuda/cuda_ipc/cuda_ipc_ep.c
Updated offset and boundary check calculations to use nested key paths (key->super->super->d_bptr). Extended cuda_ipc_event to populate pid_ns from key->super->pid_ns.
Interface implementation
src/uct/cuda/cuda_ipc/cuda_ipc_iface.c
Updated uct_cuda_ipc_complete_event() to forward pid_ns to unmap_memhandle(). Changed mapped_offset calculation to use nested key path.
Map memhandle signature
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c
Changed uct_cuda_ipc_map_memhandle() UCS_PROFILE_FUNC to accept uct_cuda_ipc_extended_rkey_t *ext_key instead of uct_cuda_ipc_rkey_t *key; added pid_ns comparison in fast-path check using ext_key->super.
Test helpers
test/gtest/uct/cuda/test_cuda_ipc_md.cc
Updated test helper functions (unpack_common, unpack, unpack_masync) to return uct_cuda_ipc_extended_rkey_t instead of uct_cuda_ipc_rkey_t. Updated field accesses to use nested structure paths (rkey.super.uuid.bytes).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Wire format compatibility: Verify that extended rkey flag handling and default namespace behavior preserve backward compatibility across serialization boundaries
  • Namespace propagation completeness: Trace pid_ns threading through all call paths (map/unmap/cache lookup/insertion) to ensure no missing propagation
  • Nested structure access patterns: Validate all nested key access chains (key->super->super, ext_key->super, etc.) are consistent and correct across all affected files
  • Test coverage: Confirm test helper updates in test_cuda_ipc_md.cc align with implementation changes and cover namespace-aware scenarios

Poem

🐰 A namespace here, a pid_ns there,
Through nested keys we hop with care!
IPC cache keys now know their place,
With extended rkeys keeping the pace! ✨
Backward compatible, clean, and bright—
Process isolation done just right!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding process namespace support to CUDA IPC remote keys. It is specific, clear, and directly reflects the primary objective.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5a54327 and 43d7f32.

📒 Files selected for processing (8)
  • src/uct/cuda/cuda_ipc/cuda_ipc_cache.c (8 hunks)
  • src/uct/cuda/cuda_ipc/cuda_ipc_cache.h (1 hunks)
  • src/uct/cuda/cuda_ipc/cuda_ipc_ep.c (2 hunks)
  • src/uct/cuda/cuda_ipc/cuda_ipc_iface.c (2 hunks)
  • src/uct/cuda/cuda_ipc/cuda_ipc_iface.h (1 hunks)
  • src/uct/cuda/cuda_ipc/cuda_ipc_md.c (8 hunks)
  • src/uct/cuda/cuda_ipc/cuda_ipc_md.h (2 hunks)
  • test/gtest/uct/cuda/test_cuda_ipc_md.cc (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/uct/cuda/cuda_ipc/cuda_ipc_cache.h (1)
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c (1)
  • uct_cuda_ipc_unmap_memhandle (491-538)
src/uct/cuda/cuda_ipc/cuda_ipc_md.c (1)
src/ucs/sys/sys.c (2)
  • ucs_sys_ns_is_default (1423-1426)
  • ucs_sys_get_ns (1398-1421)
src/uct/cuda/cuda_ipc/cuda_ipc_cache.c (1)
src/ucs/sys/sys.c (1)
  • ucs_sys_get_ns (1398-1421)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (8)
src/uct/cuda/cuda_ipc/cuda_ipc_md.h (2)

113-118: Pid namespace captured alongside the PID.

Augmenting uct_cuda_ipc_memh_t with pid_ns keeps cached handles keyed consistently with the new namespace-aware lookup path. Nicely done.


147-155: Extended rkey layering matches downstream usage.

The nested super layout slots cleanly into the existing key->super.super.* access sites while preserving the legacy rkey footprint. Looks good.

src/uct/cuda/cuda_ipc/cuda_ipc_iface.h (1)

48-55: Event descriptors now carry namespace context.

Including pid_ns here keeps the completion/unmap path in sync with the new cache keying. All consistent.

src/uct/cuda/cuda_ipc/cuda_ipc_ep.c (1)

135-191: Namespace propagation through the async copy path looks solid.

Switching to super.super for base pointer/len and wiring pid_ns into the event mirrors the extended rkey structure without altering the transfer flow. LGTM.

src/uct/cuda/cuda_ipc/cuda_ipc_iface.c (2)

324-330: Unmap call now keyed by namespace.

Passing pid_ns here ties the completion path directly into the namespace-partitioned caches. Looks correct.


441-443: Device pointer diff updated to extended layout.

Using key->super.super.d_bptr keeps the offset calculation aligned with the new struct nesting. All good.

src/uct/cuda/cuda_ipc/cuda_ipc_cache.h (1)

62-70: Cache API widened appropriately.

Updating the signatures to accept the extended rkey and explicit pid_ns keeps the cache aligned with the rest of the stack. Looks good.

src/uct/cuda/cuda_ipc/cuda_ipc_md.c (1)

298-304: Conditional mask preserves legacy layout

Lines 298-304 only append the namespace payload when ucs_sys_ns_is_default(UCS_SYS_NS_TYPE_PID) reports a non-default PID namespace, so legacy consumers that expect the bare uct_cuda_ipc_rkey_t layout remain unaffected in the default case. Nicely contained change.


Comment @coderabbitai help to get the list of available commands and usage tips.

@shasson5
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@shasson5
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants