UCT/CUDA: add loop unroll for warp copy #10977

Sergei-Lebedev · 2025-10-24T07:20:27Z

What?

Implement loop unrolling for device level warp in cuda_ipc

Why?

Improves put performance

Summary by CodeRabbit

Refactor
- Improved CUDA IPC copy paths with a generic, alignment-aware mechanism and unified warp/thread handling for more reliable and efficient device-side memory operations.
- Added multi-level copy support enabling optimized behavior for different execution scopes.
Tests
- Expanded CUDA IPC test coverage across multiple suites to validate the updated memory-paths.

src/uct/cuda/cuda_ipc/cuda_ipc.cuh

w1ldptr · 2025-11-04T15:45:44Z

src/uct/cuda/cuda_ipc/cuda_ipc.cuh

+{
+    cuda::atomic_ref<uint64_t, cuda::thread_scope_system> dst_ref{*dst};
+    dst_ref.fetch_add(inc_value, cuda::memory_order_relaxed);
+    cuda::atomic_thread_fence(cuda::memory_order_release, cuda::thread_scope_system);


Isn't it a subtle race to put this fence after the increment?
What if a reader acquires the atomic and reads the data that was supposed to be release by the fence before the actual fence is executed?

Overall, this code seem to be a redundant implementation of the following one-liner:
__nv_atomic_add(dst, inc_value, __NV_ATOMIC_RELEASE, __NV_THREAD_SCOPE_SYSTEM);

this pr doesn't change the previous implementation of atomic, it just adds unrolling to warp put operation.
Regarding the race, even if do it in different order, i.e. fence -> relaxed rmw it will not be enough since only lane 0 does it before calling level_sync. So yes, it could be a race, but depends on what guarantees ucx api provides

coderabbitai · 2025-11-05T09:05:26Z

Walkthrough

Refactors CUDA IPC copy paths to a templated, vector-type aligned-copy mechanism; replaces fixed warp-size macro with UCS_DEVICE_NUM_THREADS_IN_WARP; adds level-templated device copy/ep APIs and a device atomic increment; expands tests to include CUDA IPC variants and removes a redundant test warp-size constant.

Changes

Cohort / File(s)	Summary
CUDA IPC core refactor `src/uct/cuda/cuda_ipc/cuda_ipc.cuh`	Removed `UCT_CUDA_IPC_WARP_SIZE`; use `UCS_DEVICE_NUM_THREADS_IN_WARP`. Added templated `uct_cuda_ipc_try_copy_aligned<vec_t>`, consolidated vectorized copy paths, updated `uct_cuda_ipc_copy_level` to use templates/levels, added `uct_cuda_ipc_map_remote` pointer-offset semantics, exposed `uct_cuda_ipc_atomic_inc`, and added level-templated `ep_put`/`ep_atomic_add` wrappers and device-level copy specializations.
Test suite expansions `test/gtest/ucp/test_ucp_device.cc`	Added CUDA IPC variants to existing test suites (`test_ucp_device`, `test_ucp_device_kernel`, `test_ucp_device_xfer`) so tests run with cuda_ipc in addition to rc_gda.
Test constant removal `test/gtest/uct/cuda/test_cuda_ipc_device.cc`	Removed `static const unsigned WARP_SIZE = 32` from `test_cuda_ipc_rma` (redundant fixed warp-size).

Sequence Diagram

sequenceDiagram
    participant Caller
    participant EP as uct_cuda_ipc_ep_put_*<br/>(level)
    participant Level as uct_cuda_ipc_copy_level<br/>(level)
    participant Aligned as uct_cuda_ipc_try_copy_aligned<vec_t>
    participant Mem as vec_load/store (cg)

    Caller->>EP: call ep_put with level template
    EP->>Level: dispatch to copy_level (level)
    alt THREAD level
        Level->>Mem: inline memcpy (thread)
    else BLOCK/WARP level
        Level->>Aligned: invoke templated aligned-copy
        Aligned->>Aligned: check alignment & lane/warp ids
        alt aligned
            Aligned->>Mem: vectorized cg load/store (vec_t)
        else unaligned/tail
            Aligned->>Mem: scalar fallback per-byte
        end
    else GRID level
        Level->>Level: placeholder (not implemented)
    end
    Level-->>EP: copy complete
    EP-->>Caller: return status

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Attention areas:
- uct_cuda_ipc_try_copy_aligned<vec_t> alignment checks and loop correctness
- Warp/lane calculations using UCS_DEVICE_NUM_THREADS_IN_WARP
- Template dispatch for uct_cuda_ipc_copy_level and ep_put wrappers
- Vectorized load/store helpers and tail handling

Poem

🐰 I hop through bytes with gentle paws,
Templates nuzzle old fixed laws.
Warps now learn to count and share,
Aligned vectors leap through air.
A tiny rabbit cheers the fares! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title 'UCT/CUDA: add loop unroll for warp copy' is partially related to the changeset but does not reflect the main changes made, which involve templating, device-level refactoring, and eliminating fixed warp-size assumptions.	Revise the title to better reflect the primary changes, such as 'UCT/CUDA: refactor IPC copy with device-level templates and dynamic warp handling' or similar, to accurately convey the scope and nature of the refactoring.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2)

113-114: Extract repeated expression for readability.

The expression UCS_DEVICE_NUM_THREADS_IN_WARP * UCT_CUDA_IPC_COPY_LOOP_UNROLL is repeated multiple times throughout this function. Past reviewers also flagged this.

Apply this diff to introduce a local constant:

 {
     constexpr size_t vec_size = sizeof(vec_t);
+    constexpr size_t warp_lanes_unroll = UCS_DEVICE_NUM_THREADS_IN_WARP * 
+                                          UCT_CUDA_IPC_COPY_LOOP_UNROLL;
 
     if (!(UCT_CUDA_IPC_IS_ALIGNED_POW2((intptr_t)src, vec_size) &&
           UCT_CUDA_IPC_IS_ALIGNED_POW2((intptr_t)dst, vec_size))) {
         return;
     }
 
     auto src_vec                    = reinterpret_cast<const vec_t*>(src);
     auto dst_vec                    = reinterpret_cast<vec_t*>(dst);
-    constexpr size_t lanes_unroll   = UCS_DEVICE_NUM_THREADS_IN_WARP *
-                                      UCT_CUDA_IPC_COPY_LOOP_UNROLL;
+    constexpr size_t lanes_unroll   = warp_lanes_unroll;
     size_t num_lines                = (len / (lanes_unroll * vec_size)) *
                                       lanes_unroll;
 
     for (size_t line = warp_id * lanes_unroll + lane_id % UCS_DEVICE_NUM_THREADS_IN_WARP;
          line < num_lines;
          line += num_warps * lanes_unroll) {
         vec_t tmp[UCT_CUDA_IPC_COPY_LOOP_UNROLL];
 #pragma unroll
         for (int i = 0; i < UCT_CUDA_IPC_COPY_LOOP_UNROLL; i++) {
             tmp[i] = uct_cuda_ipc_ld_global_cg(
-                src_vec + (line + UCS_DEVICE_NUM_THREADS_IN_WARP * i));
+                src_vec + (line + (warp_lanes_unroll / UCT_CUDA_IPC_COPY_LOOP_UNROLL) * i));
         }
 
 #pragma unroll
         for (int i = 0; i < UCT_CUDA_IPC_COPY_LOOP_UNROLL; i++) {
             uct_cuda_ipc_st_global_cg(
-                dst_vec + (line + UCS_DEVICE_NUM_THREADS_IN_WARP * i), tmp[i]);
+                dst_vec + (line + (warp_lanes_unroll / UCT_CUDA_IPC_COPY_LOOP_UNROLL) * i), tmp[i]);
         }
     }

Based on learnings

Also applies to: 118-118, 120-120, 125-125, 131-131

115-116: Consider renaming num_lines to num_vectors for clarity.

The variable num_lines represents the count of vector elements rather than lines of code or data lines. As noted in past review feedback, num_vectors would more accurately reflect what is being counted and improve code readability.

Based on learnings

Also applies to: 139-140

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5db80cf and aa3ed75.

📒 Files selected for processing (3)

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2 hunks)
test/gtest/ucp/test_ucp_device.cc (3 hunks)
test/gtest/uct/cuda/test_cuda_ipc_device.cc (0 hunks)

💤 Files with no reviewable changes (1)

test/gtest/uct/cuda/test_cuda_ipc_device.cc

🔇 Additional comments (4)

test/gtest/ucp/test_ucp_device.cc (1)

259-259: LGTM! Test coverage expanded for CUDA IPC.

The additions of cuda_ipc test variants to the three test suites (test_ucp_device, test_ucp_device_kernel, and test_ucp_device_xfer) appropriately expand test coverage to exercise the CUDA IPC memory path alongside the existing rc_gda tests.

Also applies to: 350-351, 644-645

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (3)

62-63: LGTM! Using standard warp constant.

Replacing the hardcoded warp size with UCS_DEVICE_NUM_THREADS_IN_WARP improves portability and aligns with device-level abstractions used throughout the codebase.

165-185: LGTM! Well-structured copy implementation.

The tiered approach to copying (int4 → int2 → byte-by-byte) efficiently handles different alignment scenarios. The thread-level specialization using memcpy is appropriate for single-threaded execution.

Also applies to: 187-193

195-199: Verify if grid-level copy support is needed.

The UCS_DEVICE_LEVEL_GRID specialization is a placeholder with no implementation. Confirm whether grid-level copy support is planned for future work or if this level is intentionally unsupported for CUDA IPC operations.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (1)

157-163: The fence ordering remains a pre-existing concern.

As noted in previous review comments, placing the fence after the relaxed atomic could allow readers to observe the incremented value before the fence establishes the release semantics. Consider using __nv_atomic_add(dst, inc_value, __NV_ATOMIC_RELEASE, __NV_THREAD_SCOPE_SYSTEM) to provide atomicity and release ordering in a single operation. However, since this PR refactors existing logic without changing its semantics, addressing this concern can be deferred.

🧹 Nitpick comments (2)

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2)
182-184: Consider renaming loop variable for clarity.

The variable name line is misleading since it iterates over individual bytes, not lines. Consider renaming it to i or byte_idx for clarity.

Apply this diff:
-    for (size_t line = lane_id; line < len; line += num_lanes) {
-        d1[line] = s1[line];
+    for (size_t i = lane_id; i < len; i += num_lanes) {
+        d1[i] = s1[i];
     }
195-199: Grid-level implementation is a placeholder.

The grid-level copy is not yet implemented. Ensure this is tracked if grid-level operations are planned for future use.

Do you want me to open an issue to track the grid-level implementation?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aa3ed75 and 1694a40.

📒 Files selected for processing (1)

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: UCX PR (Static_check Static checks)
GitHub Check: UCX PR (Codestyle ctags check)
GitHub Check: UCX PR (Codestyle format code)
GitHub Check: UCX PR (Codestyle AUTHORS file update check)
GitHub Check: UCX PR (Codestyle commit title)
GitHub Check: UCX PR (Codestyle codespell check)
GitHub Check: UCX release DRP (Prepare CheckRelease)
GitHub Check: UCX release (Prepare CheckRelease)
GitHub Check: UCX snapshot (Prepare Check)

🔇 Additional comments (5)

src/uct/cuda/cuda_ipc/cuda_ipc.cuh (5)

62-63: LGTM! Good refactor to use standard constant.

Replacing the local UCT_CUDA_IPC_WARP_SIZE macro with UCS_DEVICE_NUM_THREADS_IN_WARP improves consistency across the codebase.

118-118: The modulo operation is necessary for multi-warp scenarios.

The lane_id % UCS_DEVICE_NUM_THREADS_IN_WARP at line 118 is correct. While it appears redundant for UCS_DEVICE_LEVEL_WARP (where lane_id is already the warp-lane ID), it's essential for UCS_DEVICE_LEVEL_BLOCK, where lane_id equals threadIdx.x and can exceed 32 when multiple warps participate in the copy.

150-155: LGTM! Clean pointer arithmetic.

The uct_cuda_ipc_map_remote implementation correctly computes the mapped address.

187-193: LGTM! Appropriate use of memcpy for single-threaded case.

The thread-level specialization correctly uses memcpy for sequential copy operations.

201-318: LGTM! Consistent level-templated interface.

All endpoint functions now correctly use the level template parameter with appropriate defaults (UCS_DEVICE_LEVEL_BLOCK). The integration with uct_cuda_ipc_copy_level<level> and uct_cuda_ipc_level_sync<level> is consistent throughout.

brminich reviewed Oct 24, 2025

View reviewed changes

Sergei-Lebedev added 2 commits October 27, 2025 10:54

UCT/CUDA: add loop unroll for warp copy

ea8d74e

UCT/CUDA: fix review comments

0183280

Sergei-Lebedev force-pushed the topic/cuda_ipc_warp_unroll_put branch from 85b7fc6 to 0183280 Compare October 27, 2025 09:54

rakhmets reviewed Oct 27, 2025

View reviewed changes

src/uct/cuda/cuda_ipc/cuda_ipc.cuh Outdated Show resolved Hide resolved

UCT/CUDA: fix review comments 2

ee2c733

brminich previously approved these changes Oct 30, 2025

View reviewed changes

rakhmets reviewed Oct 30, 2025

View reviewed changes

w1ldptr reviewed Nov 4, 2025

View reviewed changes

Sergei-Lebedev dismissed brminich’s stale review via aa3ed75 November 5, 2025 09:05

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

UCT/CUDA: fix review comments 3

1694a40

Sergei-Lebedev force-pushed the topic/cuda_ipc_warp_unroll_put branch from aa3ed75 to 1694a40 Compare November 5, 2025 09:15

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

UCT/CUDA: add loop unroll for warp copy #10977

Are you sure you want to change the base?

UCT/CUDA: add loop unroll for warp copy #10977

Uh oh!

Conversation

Sergei-Lebedev commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

Summary by CodeRabbit

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

w1ldptr Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Sergei-Lebedev Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sergei-Lebedev commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 5, 2025 •

edited

Loading