Skip to content

Conversation

@roiedanino
Copy link
Contributor

@roiedanino roiedanino commented Nov 19, 2025

What?

Fixing the latency estimation for zcopy fast-completion protocol - fast completion does not wait for ack, only to the cqe

Why?

Currently, zcopy latency is overestimated, causing bcopy to be selected in the 2k-8k range, while zcopy performs better

@roiedanino roiedanino self-assigned this Nov 19, 2025
@roiedanino roiedanino added the Optimization Code / performance optimization label Nov 19, 2025
@coderabbitai
Copy link

coderabbitai bot commented Nov 19, 2025

Walkthrough

A Grace CPU-specific performance overhead adjustment is introduced in the InfiniBand interface layer. When a Grace CPU is detected, the bcopy send overhead used in performance estimation is incremented by 2 microseconds to account for platform-specific characteristics.

Changes

Cohort / File(s) Summary
Grace CPU Overhead Tuning
src/uct/ib/base/ib_iface.c
Introduced a local bcopy_send_overhead variable that mirrors send_overhead->bcopy and adds a 2-microsecond adjustment for NVIDIA Grace CPUs during performance threshold calculations. This optimizes performance estimation for the Grace architecture.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Verify the CPU detection logic correctly identifies Grace processors
  • Confirm the 2-microsecond overhead value is empirically justified
  • Ensure bcopy_send_overhead is used consistently in all relevant overhead calculations

Poem

🐰 A rabbit hops through circuits bright,
Grace CPUs tuned just right,
Two microseconds, a gentle tweak,
Performance peaks we seek!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a Grace CPU-specific performance tuning that increases bcopy overhead. It directly relates to the file modifications and PR objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/uct/ib/base/ib_iface.c (1)

2050-2053: CPU detection API verified; optional refactor suggestion remains valid.

The CPU detection functions (ucs_arch_get_cpu_vendor(), ucs_arch_get_cpu_model()) and constants (UCS_CPU_VENDOR_NVIDIA, UCS_CPU_MODEL_NVIDIA_GRACE) are properly defined across all supported architectures and available in the codebase. The Grace CPU detection pattern at line 2051 follows the same approach already used elsewhere (e.g., src/ucs/arch/aarch64/cpu.h:147-149).

For improved maintainability, consider extracting the magic number 2 * 1e-6 into a named constant:

+#define UCT_IB_GRACE_CPU_BCOPY_OVERHEAD_ADJUSTMENT 2e-6
+
 ucs_status_t
 uct_ib_iface_estimate_perf(uct_iface_h iface, uct_perf_attr_t *perf_attr)
 {
     ...
      /* NEW: Adjust overhead for Grace CPU to influence threshold calculations */
      if ((ucs_arch_get_cpu_vendor() == UCS_CPU_VENDOR_NVIDIA) && (ucs_arch_get_cpu_model() == UCS_CPU_MODEL_NVIDIA_GRACE)) {
-        bcopy_send_overhead += 2 * 1e-6;
+        bcopy_send_overhead += UCT_IB_GRACE_CPU_BCOPY_OVERHEAD_ADJUSTMENT;
      }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03245a7 and eca1d40.

📒 Files selected for processing (1)
  • src/uct/ib/base/ib_iface.c (2 hunks)
🔇 Additional comments (2)
src/uct/ib/base/ib_iface.c (2)

2040-2040: LGTM: Clean local variable initialization.

Creating a local copy of the bcopy overhead allows platform-specific adjustments without modifying the original configuration structure.


2055-2060: LGTM: Correct usage of adjusted overhead for bcopy operations.

The adjusted bcopy_send_overhead is correctly applied only to bcopy operations, which will influence the bcopy vs zcopy threshold calculations as intended. The scope is properly limited by the uct_ep_op_is_bcopy(op) check.


/* NEW: Adjust overhead for Grace CPU to influence threshold calculations */
if ((ucs_arch_get_cpu_vendor() == UCS_CPU_VENDOR_NVIDIA) && (ucs_arch_get_cpu_model() == UCS_CPU_MODEL_NVIDIA_GRACE)) {
bcopy_send_overhead += 2 * 1e-6;
Copy link
Contributor

@rakhmets rakhmets Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually try to make the values reflect the actual characteristics. And the changes made indicate an increase in estimated overhead of allocating a TX buffer on Grace CPU. Is this really the reason?
Maybe it's because of lower overheads associated with zcopy (cqe, estimated overhead of processing a work request completion)?

P.S. It is preferable to make such changes to ucx.conf (as you mentioned in the PR description).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for both comments

@roiedanino roiedanino changed the title UCT/IB/BASE: Perf tuning - increase bcopy overhead for grace cpus UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated Nov 24, 2025
Comment on lines 172 to 180
if (params->flags & UCP_PROTO_COMMON_INIT_FLAG_SEND_ZCOPY) {
if (op_attr_mask & UCP_OP_ATTR_FLAG_FAST_CMPL) {
/* Waits only for cqe completion not for ACK */
perf_factors[UCP_PROTO_PERF_FACTOR_LATENCY].c += tl_perf->send_post_overhead;
} else {
/* Send time is representing request completion, which in case of zcopy
waits for ACK from remote side. */
perf_factors[UCP_PROTO_PERF_FACTOR_LATENCY].c += tl_perf->latency;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ack is being sent only in RNDV (which doubles latency in measurements) we might need to add a UCP_PROTO_COMMON_INIT_FLAG_RNDV flag

@yosefe
Copy link
Contributor

yosefe commented Nov 25, 2025

the original intent of FAST_COMPLETION protocols is to indicate that the time the operation is blocking, so the time needed to complete the request on the sender side should also be taken into account. for example, with zcopy, there is the additional ACK latency.
The question here is in which flow zcopy is not selected in 2k-8k range (which application), maybe FAST_COMPLETION flag should not have been passed there?

@roiedanino
Copy link
Contributor Author

roiedanino commented Nov 25, 2025

@yosefe

The question here is in which flow zcopy is not selected in 2k-8k range (which application), maybe FAST_COMPLETION flag should not have been passed there?

OSU benchmarks, MPI_Send()

Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy, so it's either a bug or we only wait for ACKs in RNDV?

@yosefe
Copy link
Contributor

yosefe commented Nov 27, 2025

Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy,
Yes we are, because send CQE waits for HW ACK
BTW, what happens if we reduce the memcpy bandwidth by x2?

@roiedanino
Copy link
Contributor Author

Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy,
Yes we are, because send CQE waits for HW ACK
BTW, what happens if we reduce the memcpy bandwidth by x2?

I tried that, but it only switches to zcopy when setting BCOPY_BW=2000, which is 30 times slower than reality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Optimization Code / performance optimization Ready for Review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants