Skip to content

Conversation

@Sergei-Lebedev
Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev commented Oct 24, 2025

What?

Implement loop unrolling for device level warp in cuda_ipc

Why?

Improves put performance

Summary by CodeRabbit

  • Refactor

    • Improved CUDA IPC copy paths with a generic, alignment-aware mechanism and unified warp/thread handling for more reliable and efficient device-side memory operations.
    • Added multi-level copy support enabling optimized behavior for different execution scopes.
  • Tests

    • Expanded CUDA IPC test coverage across multiple suites to validate the updated memory-paths.

@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/cuda_ipc_warp_unroll_put branch from 85b7fc6 to 0183280 Compare October 27, 2025 09:54
brminich
brminich previously approved these changes Oct 30, 2025
{
cuda::atomic_ref<uint64_t, cuda::thread_scope_system> dst_ref{*dst};
dst_ref.fetch_add(inc_value, cuda::memory_order_relaxed);
cuda::atomic_thread_fence(cuda::memory_order_release, cuda::thread_scope_system);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it a subtle race to put this fence after the increment?
What if a reader acquires the atomic and reads the data that was supposed to be release by the fence before the actual fence is executed?

Overall, this code seem to be a redundant implementation of the following one-liner:
__nv_atomic_add(dst, inc_value, __NV_ATOMIC_RELEASE, __NV_THREAD_SCOPE_SYSTEM);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this pr doesn't change the previous implementation of atomic, it just adds unrolling to warp put operation.
Regarding the race, even if do it in different order, i.e. fence -> relaxed rmw it will not be enough since only lane 0 does it before calling level_sync. So yes, it could be a race, but depends on what guarantees ucx api provides

@coderabbitai
Copy link

coderabbitai bot commented Nov 5, 2025

Walkthrough

Refactors CUDA IPC copy paths to a templated, vector-type aligned-copy mechanism; replaces fixed warp-size macro with UCS_DEVICE_NUM_THREADS_IN_WARP; adds level-templated device copy/ep APIs and a device atomic increment; expands tests to include CUDA IPC variants and removes a redundant test warp-size constant.

Changes

Cohort / File(s) Summary
CUDA IPC core refactor
src/uct/cuda/cuda_ipc/cuda_ipc.cuh
Removed UCT_CUDA_IPC_WARP_SIZE; use UCS_DEVICE_NUM_THREADS_IN_WARP. Added templated uct_cuda_ipc_try_copy_aligned<vec_t>, consolidated vectorized copy paths, updated uct_cuda_ipc_copy_level to use templates/levels, added uct_cuda_ipc_map_remote pointer-offset semantics, exposed uct_cuda_ipc_atomic_inc, and added level-templated ep_put/ep_atomic_add wrappers and device-level copy specializations.
Test suite expansions
test/gtest/ucp/test_ucp_device.cc
Added CUDA IPC variants to existing test suites (test_ucp_device, test_ucp_device_kernel, test_ucp_device_xfer) so tests run with cuda_ipc in addition to rc_gda.
Test constant removal
test/gtest/uct/cuda/test_cuda_ipc_device.cc
Removed static const unsigned WARP_SIZE = 32 from test_cuda_ipc_rma (redundant fixed warp-size).

Sequence Diagram

sequenceDiagram
    participant Caller
    participant EP as uct_cuda_ipc_ep_put_*<br/>(level)
    participant Level as uct_cuda_ipc_copy_level<br/>(level)
    participant Aligned as uct_cuda_ipc_try_copy_aligned<vec_t>
    participant Mem as vec_load/store (cg)

    Caller->>EP: call ep_put with level template
    EP->>Level: dispatch to copy_level (level)
    alt THREAD level
        Level->>Mem: inline memcpy (thread)
    else BLOCK/WARP level
        Level->>Aligned: invoke templated aligned-copy
        Aligned->>Aligned: check alignment & lane/warp ids
        alt aligned
            Aligned->>Mem: vectorized cg load/store (vec_t)
        else unaligned/tail
            Aligned->>Mem: scalar fallback per-byte
        end
    else GRID level
        Level->>Level: placeholder (not implemented)
    end
    Level-->>EP: copy complete
    EP-->>Caller: return status
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Attention areas:
    • uct_cuda_ipc_try_copy_aligned<vec_t> alignment checks and loop correctness
    • Warp/lane calculations using UCS_DEVICE_NUM_THREADS_IN_WARP
    • Template dispatch for uct_cuda_ipc_copy_level and ep_put wrappers
    • Vectorized load/store helpers and tail handling

Poem

🐰 I hop through bytes with gentle paws,
Templates nuzzle old fixed laws.
Warps now learn to count and share,
Aligned vectors leap through air.
A tiny rabbit cheers the fares! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title 'UCT/CUDA: add loop unroll for warp copy' is partially related to the changeset but does not reflect the main changes made, which involve templating, device-level refactoring, and eliminating fixed warp-size assumptions. Revise the title to better reflect the primary changes, such as 'UCT/CUDA: refactor IPC copy with device-level templates and dynamic warp handling' or similar, to accurately convey the scope and nature of the refactoring.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2)

113-114: Extract repeated expression for readability.

The expression UCS_DEVICE_NUM_THREADS_IN_WARP * UCT_CUDA_IPC_COPY_LOOP_UNROLL is repeated multiple times throughout this function. Past reviewers also flagged this.

Apply this diff to introduce a local constant:

 {
     constexpr size_t vec_size = sizeof(vec_t);
+    constexpr size_t warp_lanes_unroll = UCS_DEVICE_NUM_THREADS_IN_WARP * 
+                                          UCT_CUDA_IPC_COPY_LOOP_UNROLL;
 
     if (!(UCT_CUDA_IPC_IS_ALIGNED_POW2((intptr_t)src, vec_size) &&
           UCT_CUDA_IPC_IS_ALIGNED_POW2((intptr_t)dst, vec_size))) {
         return;
     }
 
     auto src_vec                    = reinterpret_cast<const vec_t*>(src);
     auto dst_vec                    = reinterpret_cast<vec_t*>(dst);
-    constexpr size_t lanes_unroll   = UCS_DEVICE_NUM_THREADS_IN_WARP *
-                                      UCT_CUDA_IPC_COPY_LOOP_UNROLL;
+    constexpr size_t lanes_unroll   = warp_lanes_unroll;
     size_t num_lines                = (len / (lanes_unroll * vec_size)) *
                                       lanes_unroll;
 
     for (size_t line = warp_id * lanes_unroll + lane_id % UCS_DEVICE_NUM_THREADS_IN_WARP;
          line < num_lines;
          line += num_warps * lanes_unroll) {
         vec_t tmp[UCT_CUDA_IPC_COPY_LOOP_UNROLL];
 #pragma unroll
         for (int i = 0; i < UCT_CUDA_IPC_COPY_LOOP_UNROLL; i++) {
             tmp[i] = uct_cuda_ipc_ld_global_cg(
-                src_vec + (line + UCS_DEVICE_NUM_THREADS_IN_WARP * i));
+                src_vec + (line + (warp_lanes_unroll / UCT_CUDA_IPC_COPY_LOOP_UNROLL) * i));
         }
 
 #pragma unroll
         for (int i = 0; i < UCT_CUDA_IPC_COPY_LOOP_UNROLL; i++) {
             uct_cuda_ipc_st_global_cg(
-                dst_vec + (line + UCS_DEVICE_NUM_THREADS_IN_WARP * i), tmp[i]);
+                dst_vec + (line + (warp_lanes_unroll / UCT_CUDA_IPC_COPY_LOOP_UNROLL) * i), tmp[i]);
         }
     }

Based on learnings

Also applies to: 118-118, 120-120, 125-125, 131-131


115-116: Consider renaming num_lines to num_vectors for clarity.

The variable num_lines represents the count of vector elements rather than lines of code or data lines. As noted in past review feedback, num_vectors would more accurately reflect what is being counted and improve code readability.

Based on learnings

Also applies to: 139-140

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5db80cf and aa3ed75.

📒 Files selected for processing (3)
  • src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2 hunks)
  • test/gtest/ucp/test_ucp_device.cc (3 hunks)
  • test/gtest/uct/cuda/test_cuda_ipc_device.cc (0 hunks)
💤 Files with no reviewable changes (1)
  • test/gtest/uct/cuda/test_cuda_ipc_device.cc
🔇 Additional comments (4)
test/gtest/ucp/test_ucp_device.cc (1)

259-259: LGTM! Test coverage expanded for CUDA IPC.

The additions of cuda_ipc test variants to the three test suites (test_ucp_device, test_ucp_device_kernel, and test_ucp_device_xfer) appropriately expand test coverage to exercise the CUDA IPC memory path alongside the existing rc_gda tests.

Also applies to: 350-351, 644-645

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (3)

62-63: LGTM! Using standard warp constant.

Replacing the hardcoded warp size with UCS_DEVICE_NUM_THREADS_IN_WARP improves portability and aligns with device-level abstractions used throughout the codebase.


165-185: LGTM! Well-structured copy implementation.

The tiered approach to copying (int4 → int2 → byte-by-byte) efficiently handles different alignment scenarios. The thread-level specialization using memcpy is appropriate for single-threaded execution.

Also applies to: 187-193


195-199: Verify if grid-level copy support is needed.

The UCS_DEVICE_LEVEL_GRID specialization is a placeholder with no implementation. Confirm whether grid-level copy support is planned for future work or if this level is intentionally unsupported for CUDA IPC operations.

@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/cuda_ipc_warp_unroll_put branch from aa3ed75 to 1694a40 Compare November 5, 2025 09:15
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (1)

157-163: The fence ordering remains a pre-existing concern.

As noted in previous review comments, placing the fence after the relaxed atomic could allow readers to observe the incremented value before the fence establishes the release semantics. Consider using __nv_atomic_add(dst, inc_value, __NV_ATOMIC_RELEASE, __NV_THREAD_SCOPE_SYSTEM) to provide atomicity and release ordering in a single operation. However, since this PR refactors existing logic without changing its semantics, addressing this concern can be deferred.

🧹 Nitpick comments (2)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2)

182-184: Consider renaming loop variable for clarity.

The variable name line is misleading since it iterates over individual bytes, not lines. Consider renaming it to i or byte_idx for clarity.

Apply this diff:

-    for (size_t line = lane_id; line < len; line += num_lanes) {
-        d1[line] = s1[line];
+    for (size_t i = lane_id; i < len; i += num_lanes) {
+        d1[i] = s1[i];
     }

195-199: Grid-level implementation is a placeholder.

The grid-level copy is not yet implemented. Ensure this is tracked if grid-level operations are planned for future use.

Do you want me to open an issue to track the grid-level implementation?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa3ed75 and 1694a40.

📒 Files selected for processing (1)
  • src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (5)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (5)

62-63: LGTM! Good refactor to use standard constant.

Replacing the local UCT_CUDA_IPC_WARP_SIZE macro with UCS_DEVICE_NUM_THREADS_IN_WARP improves consistency across the codebase.


118-118: The modulo operation is necessary for multi-warp scenarios.

The lane_id % UCS_DEVICE_NUM_THREADS_IN_WARP at line 118 is correct. While it appears redundant for UCS_DEVICE_LEVEL_WARP (where lane_id is already the warp-lane ID), it's essential for UCS_DEVICE_LEVEL_BLOCK, where lane_id equals threadIdx.x and can exceed 32 when multiple warps participate in the copy.


150-155: LGTM! Clean pointer arithmetic.

The uct_cuda_ipc_map_remote implementation correctly computes the mapped address.


187-193: LGTM! Appropriate use of memcpy for single-threaded case.

The thread-level specialization correctly uses memcpy for sequential copy operations.


201-318: LGTM! Consistent level-templated interface.

All endpoint functions now correctly use the level template parameter with appropriate defaults (UCS_DEVICE_LEVEL_BLOCK). The integration with uct_cuda_ipc_copy_level<level> and uct_cuda_ipc_level_sync<level> is consistent throughout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants