-
Notifications
You must be signed in to change notification settings - Fork 497
UCT/CUDA: add loop unroll for warp copy #10977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
UCT/CUDA: add loop unroll for warp copy #10977
Conversation
85b7fc6 to
0183280
Compare
| { | ||
| cuda::atomic_ref<uint64_t, cuda::thread_scope_system> dst_ref{*dst}; | ||
| dst_ref.fetch_add(inc_value, cuda::memory_order_relaxed); | ||
| cuda::atomic_thread_fence(cuda::memory_order_release, cuda::thread_scope_system); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a subtle race to put this fence after the increment?
What if a reader acquires the atomic and reads the data that was supposed to be release by the fence before the actual fence is executed?
Overall, this code seem to be a redundant implementation of the following one-liner:
__nv_atomic_add(dst, inc_value, __NV_ATOMIC_RELEASE, __NV_THREAD_SCOPE_SYSTEM);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this pr doesn't change the previous implementation of atomic, it just adds unrolling to warp put operation.
Regarding the race, even if do it in different order, i.e. fence -> relaxed rmw it will not be enough since only lane 0 does it before calling level_sync. So yes, it could be a race, but depends on what guarantees ucx api provides
WalkthroughRefactors CUDA IPC copy paths to a templated, vector-type aligned-copy mechanism; replaces fixed warp-size macro with UCS_DEVICE_NUM_THREADS_IN_WARP; adds level-templated device copy/ep APIs and a device atomic increment; expands tests to include CUDA IPC variants and removes a redundant test warp-size constant. Changes
Sequence DiagramsequenceDiagram
participant Caller
participant EP as uct_cuda_ipc_ep_put_*<br/>(level)
participant Level as uct_cuda_ipc_copy_level<br/>(level)
participant Aligned as uct_cuda_ipc_try_copy_aligned<vec_t>
participant Mem as vec_load/store (cg)
Caller->>EP: call ep_put with level template
EP->>Level: dispatch to copy_level (level)
alt THREAD level
Level->>Mem: inline memcpy (thread)
else BLOCK/WARP level
Level->>Aligned: invoke templated aligned-copy
Aligned->>Aligned: check alignment & lane/warp ids
alt aligned
Aligned->>Mem: vectorized cg load/store (vec_t)
else unaligned/tail
Aligned->>Mem: scalar fallback per-byte
end
else GRID level
Level->>Level: placeholder (not implemented)
end
Level-->>EP: copy complete
EP-->>Caller: return status
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (1 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2)
113-114: Extract repeated expression for readability.The expression
UCS_DEVICE_NUM_THREADS_IN_WARP * UCT_CUDA_IPC_COPY_LOOP_UNROLLis repeated multiple times throughout this function. Past reviewers also flagged this.Apply this diff to introduce a local constant:
{ constexpr size_t vec_size = sizeof(vec_t); + constexpr size_t warp_lanes_unroll = UCS_DEVICE_NUM_THREADS_IN_WARP * + UCT_CUDA_IPC_COPY_LOOP_UNROLL; if (!(UCT_CUDA_IPC_IS_ALIGNED_POW2((intptr_t)src, vec_size) && UCT_CUDA_IPC_IS_ALIGNED_POW2((intptr_t)dst, vec_size))) { return; } auto src_vec = reinterpret_cast<const vec_t*>(src); auto dst_vec = reinterpret_cast<vec_t*>(dst); - constexpr size_t lanes_unroll = UCS_DEVICE_NUM_THREADS_IN_WARP * - UCT_CUDA_IPC_COPY_LOOP_UNROLL; + constexpr size_t lanes_unroll = warp_lanes_unroll; size_t num_lines = (len / (lanes_unroll * vec_size)) * lanes_unroll; for (size_t line = warp_id * lanes_unroll + lane_id % UCS_DEVICE_NUM_THREADS_IN_WARP; line < num_lines; line += num_warps * lanes_unroll) { vec_t tmp[UCT_CUDA_IPC_COPY_LOOP_UNROLL]; #pragma unroll for (int i = 0; i < UCT_CUDA_IPC_COPY_LOOP_UNROLL; i++) { tmp[i] = uct_cuda_ipc_ld_global_cg( - src_vec + (line + UCS_DEVICE_NUM_THREADS_IN_WARP * i)); + src_vec + (line + (warp_lanes_unroll / UCT_CUDA_IPC_COPY_LOOP_UNROLL) * i)); } #pragma unroll for (int i = 0; i < UCT_CUDA_IPC_COPY_LOOP_UNROLL; i++) { uct_cuda_ipc_st_global_cg( - dst_vec + (line + UCS_DEVICE_NUM_THREADS_IN_WARP * i), tmp[i]); + dst_vec + (line + (warp_lanes_unroll / UCT_CUDA_IPC_COPY_LOOP_UNROLL) * i), tmp[i]); } }Based on learnings
Also applies to: 118-118, 120-120, 125-125, 131-131
115-116: Consider renamingnum_linestonum_vectorsfor clarity.The variable
num_linesrepresents the count of vector elements rather than lines of code or data lines. As noted in past review feedback,num_vectorswould more accurately reflect what is being counted and improve code readability.Based on learnings
Also applies to: 139-140
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh(2 hunks)test/gtest/ucp/test_ucp_device.cc(3 hunks)test/gtest/uct/cuda/test_cuda_ipc_device.cc(0 hunks)
💤 Files with no reviewable changes (1)
- test/gtest/uct/cuda/test_cuda_ipc_device.cc
🔇 Additional comments (4)
test/gtest/ucp/test_ucp_device.cc (1)
259-259: LGTM! Test coverage expanded for CUDA IPC.The additions of cuda_ipc test variants to the three test suites (test_ucp_device, test_ucp_device_kernel, and test_ucp_device_xfer) appropriately expand test coverage to exercise the CUDA IPC memory path alongside the existing rc_gda tests.
Also applies to: 350-351, 644-645
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (3)
62-63: LGTM! Using standard warp constant.Replacing the hardcoded warp size with
UCS_DEVICE_NUM_THREADS_IN_WARPimproves portability and aligns with device-level abstractions used throughout the codebase.
165-185: LGTM! Well-structured copy implementation.The tiered approach to copying (int4 → int2 → byte-by-byte) efficiently handles different alignment scenarios. The thread-level specialization using
memcpyis appropriate for single-threaded execution.Also applies to: 187-193
195-199: Verify if grid-level copy support is needed.The
UCS_DEVICE_LEVEL_GRIDspecialization is a placeholder with no implementation. Confirm whether grid-level copy support is planned for future work or if this level is intentionally unsupported for CUDA IPC operations.
aa3ed75 to
1694a40
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (1)
157-163: The fence ordering remains a pre-existing concern.As noted in previous review comments, placing the fence after the relaxed atomic could allow readers to observe the incremented value before the fence establishes the release semantics. Consider using
__nv_atomic_add(dst, inc_value, __NV_ATOMIC_RELEASE, __NV_THREAD_SCOPE_SYSTEM)to provide atomicity and release ordering in a single operation. However, since this PR refactors existing logic without changing its semantics, addressing this concern can be deferred.
🧹 Nitpick comments (2)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (2)
182-184: Consider renaming loop variable for clarity.The variable name
lineis misleading since it iterates over individual bytes, not lines. Consider renaming it toiorbyte_idxfor clarity.Apply this diff:
- for (size_t line = lane_id; line < len; line += num_lanes) { - d1[line] = s1[line]; + for (size_t i = lane_id; i < len; i += num_lanes) { + d1[i] = s1[i]; }
195-199: Grid-level implementation is a placeholder.The grid-level copy is not yet implemented. Ensure this is tracked if grid-level operations are planned for future use.
Do you want me to open an issue to track the grid-level implementation?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: UCX PR (Static_check Static checks)
- GitHub Check: UCX PR (Codestyle ctags check)
- GitHub Check: UCX PR (Codestyle format code)
- GitHub Check: UCX PR (Codestyle AUTHORS file update check)
- GitHub Check: UCX PR (Codestyle commit title)
- GitHub Check: UCX PR (Codestyle codespell check)
- GitHub Check: UCX release DRP (Prepare CheckRelease)
- GitHub Check: UCX release (Prepare CheckRelease)
- GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (5)
src/uct/cuda/cuda_ipc/cuda_ipc.cuh (5)
62-63: LGTM! Good refactor to use standard constant.Replacing the local
UCT_CUDA_IPC_WARP_SIZEmacro withUCS_DEVICE_NUM_THREADS_IN_WARPimproves consistency across the codebase.
118-118: The modulo operation is necessary for multi-warp scenarios.The
lane_id % UCS_DEVICE_NUM_THREADS_IN_WARPat line 118 is correct. While it appears redundant forUCS_DEVICE_LEVEL_WARP(wherelane_idis already the warp-lane ID), it's essential forUCS_DEVICE_LEVEL_BLOCK, wherelane_idequalsthreadIdx.xand can exceed 32 when multiple warps participate in the copy.
150-155: LGTM! Clean pointer arithmetic.The
uct_cuda_ipc_map_remoteimplementation correctly computes the mapped address.
187-193: LGTM! Appropriate use of memcpy for single-threaded case.The thread-level specialization correctly uses
memcpyfor sequential copy operations.
201-318: LGTM! Consistent level-templated interface.All endpoint functions now correctly use the level template parameter with appropriate defaults (
UCS_DEVICE_LEVEL_BLOCK). The integration withuct_cuda_ipc_copy_level<level>anduct_cuda_ipc_level_sync<level>is consistent throughout.
What?
Implement loop unrolling for device level warp in cuda_ipc
Why?
Improves put performance
Summary by CodeRabbit
Refactor
Tests