UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated #11016

roiedanino · 2025-11-19T15:13:34Z

What?

Fixing the latency estimation for zcopy fast-completion protocol - fast completion does not wait for ack, only to the cqe

Why?

Currently, zcopy latency is overestimated, causing bcopy to be selected in the 2k-8k range, while zcopy performs better

Signed-off-by: Roie Danino <[email protected]>

coderabbitai · 2025-11-19T15:13:47Z

Walkthrough

A Grace CPU-specific performance overhead adjustment is introduced in the InfiniBand interface layer. When a Grace CPU is detected, the bcopy send overhead used in performance estimation is incremented by 2 microseconds to account for platform-specific characteristics.

Changes

Cohort / File(s)	Summary
Grace CPU Overhead Tuning `src/uct/ib/base/ib_iface.c`	Introduced a local bcopy_send_overhead variable that mirrors send_overhead->bcopy and adds a 2-microsecond adjustment for NVIDIA Grace CPUs during performance threshold calculations. This optimizes performance estimation for the Grace architecture.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Verify the CPU detection logic correctly identifies Grace processors
Confirm the 2-microsecond overhead value is empirically justified
Ensure bcopy_send_overhead is used consistently in all relevant overhead calculations

Poem

🐰 A rabbit hops through circuits bright,
Grace CPUs tuned just right,
Two microseconds, a gentle tweak,
Performance peaks we seek! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a Grace CPU-specific performance tuning that increases bcopy overhead. It directly relates to the file modifications and PR objectives.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/uct/ib/base/ib_iface.c (1)
2050-2053: CPU detection API verified; optional refactor suggestion remains valid.

The CPU detection functions (ucs_arch_get_cpu_vendor(), ucs_arch_get_cpu_model()) and constants (UCS_CPU_VENDOR_NVIDIA, UCS_CPU_MODEL_NVIDIA_GRACE) are properly defined across all supported architectures and available in the codebase. The Grace CPU detection pattern at line 2051 follows the same approach already used elsewhere (e.g., src/ucs/arch/aarch64/cpu.h:147-149).

For improved maintainability, consider extracting the magic number 2 * 1e-6 into a named constant:
+#define UCT_IB_GRACE_CPU_BCOPY_OVERHEAD_ADJUSTMENT 2e-6
+
 ucs_status_t
 uct_ib_iface_estimate_perf(uct_iface_h iface, uct_perf_attr_t *perf_attr)
 {
     ...
      /* NEW: Adjust overhead for Grace CPU to influence threshold calculations */
      if ((ucs_arch_get_cpu_vendor() == UCS_CPU_VENDOR_NVIDIA) && (ucs_arch_get_cpu_model() == UCS_CPU_MODEL_NVIDIA_GRACE)) {
-        bcopy_send_overhead += 2 * 1e-6;
+        bcopy_send_overhead += UCT_IB_GRACE_CPU_BCOPY_OVERHEAD_ADJUSTMENT;
      }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03245a7 and eca1d40.

📒 Files selected for processing (1)

src/uct/ib/base/ib_iface.c (2 hunks)

🔇 Additional comments (2)

src/uct/ib/base/ib_iface.c (2)

2040-2040: LGTM: Clean local variable initialization.

Creating a local copy of the bcopy overhead allows platform-specific adjustments without modifying the original configuration structure.

2055-2060: LGTM: Correct usage of adjusted overhead for bcopy operations.

The adjusted bcopy_send_overhead is correctly applied only to bcopy operations, which will influence the bcopy vs zcopy threshold calculations as intended. The scope is properly limited by the uct_ep_op_is_bcopy(op) check.

rakhmets · 2025-11-20T13:54:35Z

src/uct/ib/base/ib_iface.c


+     /* NEW: Adjust overhead for Grace CPU to influence threshold calculations */
+     if ((ucs_arch_get_cpu_vendor() == UCS_CPU_VENDOR_NVIDIA) && (ucs_arch_get_cpu_model() == UCS_CPU_MODEL_NVIDIA_GRACE)) {
+        bcopy_send_overhead += 2 * 1e-6;


We usually try to make the values reflect the actual characteristics. And the changes made indicate an increase in estimated overhead of allocating a TX buffer on Grace CPU. Is this really the reason?
Maybe it's because of lower overheads associated with zcopy (cqe, estimated overhead of processing a work request completion)?

P.S. It is preferable to make such changes to ucx.conf (as you mentioned in the PR description).

+1 for both comments

Signed-off-by: Roie Danino <[email protected]>

…ing for ack Signed-off-by: Roie Danino <[email protected]>

roiedanino · 2025-11-25T13:17:54Z

src/ucp/proto/proto_init.c

+    if (params->flags & UCP_PROTO_COMMON_INIT_FLAG_SEND_ZCOPY) {
+        if (op_attr_mask & UCP_OP_ATTR_FLAG_FAST_CMPL) {
+            /* Waits only for cqe completion not for ACK */
+            perf_factors[UCP_PROTO_PERF_FACTOR_LATENCY].c += tl_perf->send_post_overhead;
+        } else {
+            /* Send time is representing request completion, which in case of zcopy
+            waits for ACK from remote side. */
+            perf_factors[UCP_PROTO_PERF_FACTOR_LATENCY].c += tl_perf->latency;
+        }


The ack is being sent only in RNDV (which doubles latency in measurements) we might need to add a UCP_PROTO_COMMON_INIT_FLAG_RNDV flag

yosefe · 2025-11-25T17:11:42Z

the original intent of FAST_COMPLETION protocols is to indicate that the time the operation is blocking, so the time needed to complete the request on the sender side should also be taken into account. for example, with zcopy, there is the additional ACK latency.
The question here is in which flow zcopy is not selected in 2k-8k range (which application), maybe FAST_COMPLETION flag should not have been passed there?

roiedanino · 2025-11-25T17:54:56Z

@yosefe

The question here is in which flow zcopy is not selected in 2k-8k range (which application), maybe FAST_COMPLETION flag should not have been passed there?

OSU benchmarks, MPI_Send()

Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy, so it's either a bug or we only wait for ACKs in RNDV?

…y just in rndv Signed-off-by: Roie Danino <[email protected]>

yosefe · 2025-11-27T13:36:30Z

Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy,
Yes we are, because send CQE waits for HW ACK
BTW, what happens if we reduce the memcpy bandwidth by x2?

roiedanino · 2025-11-27T13:42:14Z

Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy,
Yes we are, because send CQE waits for HW ACK
BTW, what happens if we reduce the memcpy bandwidth by x2?

I tried that, but it only switches to zcopy when setting BCOPY_BW=2000, which is 30 times slower than reality

UCT/IB/BASE: perf tuning - increase bcopy overhead for grace cpus

eca1d40

Signed-off-by: Roie Danino <[email protected]>

roiedanino self-assigned this Nov 19, 2025

roiedanino added the Optimization Code / performance optimization label Nov 19, 2025

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

roiedanino added the Ready for Review label Nov 19, 2025

rakhmets reviewed Nov 20, 2025

View reviewed changes

roiedanino added 2 commits November 23, 2025 14:10

CONFIG: adjusted IB bcopy overhead in ucx.conf

c3bc93c

Signed-off-by: Roie Danino <[email protected]>

UCP/PROTO: Fix zcopy latency estimation - fast completion is not wait…

3aa029e

…ing for ack Signed-off-by: Roie Danino <[email protected]>

roiedanino changed the title ~~UCT/IB/BASE: Perf tuning - increase bcopy overhead for grace cpus~~ UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated Nov 24, 2025

roiedanino commented Nov 25, 2025

View reviewed changes

UCP/PROTO: added UCP_PROTO_COMMON_INIT_FLAG_RNDV flag, add ack latenc…

ad148bc

…y just in rndv Signed-off-by: Roie Danino <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated #11016

UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated #11016

Uh oh!

roiedanino commented Nov 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

rakhmets Nov 20, 2025 •

edited

Loading

Uh oh!

yosefe Nov 23, 2025

Uh oh!

roiedanino Nov 25, 2025

Uh oh!

yosefe commented Nov 25, 2025

Uh oh!

roiedanino commented Nov 25, 2025 •

edited

Loading

Uh oh!

yosefe commented Nov 27, 2025 •

edited

Loading

Uh oh!

roiedanino commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated #11016

Are you sure you want to change the base?

UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated #11016

Uh oh!

Conversation

roiedanino commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

Uh oh!

coderabbitai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

rakhmets Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yosefe Nov 23, 2025

Choose a reason for hiding this comment

Uh oh!

roiedanino Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

yosefe commented Nov 25, 2025

Uh oh!

roiedanino commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yosefe commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roiedanino commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

roiedanino commented Nov 19, 2025 •

edited

Loading

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

rakhmets Nov 20, 2025 •

edited

Loading

roiedanino commented Nov 25, 2025 •

edited

Loading

yosefe commented Nov 27, 2025 •

edited

Loading