-
Notifications
You must be signed in to change notification settings - Fork 497
UCP/PROTO: Perf tuning - zcopy fast-completion latency is over-estimated #11016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Roie Danino <[email protected]>
WalkthroughA Grace CPU-specific performance overhead adjustment is introduced in the InfiniBand interface layer. When a Grace CPU is detected, the bcopy send overhead used in performance estimation is incremented by 2 microseconds to account for platform-specific characteristics. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/uct/ib/base/ib_iface.c (1)
2050-2053: CPU detection API verified; optional refactor suggestion remains valid.The CPU detection functions (
ucs_arch_get_cpu_vendor(),ucs_arch_get_cpu_model()) and constants (UCS_CPU_VENDOR_NVIDIA,UCS_CPU_MODEL_NVIDIA_GRACE) are properly defined across all supported architectures and available in the codebase. The Grace CPU detection pattern at line 2051 follows the same approach already used elsewhere (e.g.,src/ucs/arch/aarch64/cpu.h:147-149).For improved maintainability, consider extracting the magic number
2 * 1e-6into a named constant:+#define UCT_IB_GRACE_CPU_BCOPY_OVERHEAD_ADJUSTMENT 2e-6 + ucs_status_t uct_ib_iface_estimate_perf(uct_iface_h iface, uct_perf_attr_t *perf_attr) { ... /* NEW: Adjust overhead for Grace CPU to influence threshold calculations */ if ((ucs_arch_get_cpu_vendor() == UCS_CPU_VENDOR_NVIDIA) && (ucs_arch_get_cpu_model() == UCS_CPU_MODEL_NVIDIA_GRACE)) { - bcopy_send_overhead += 2 * 1e-6; + bcopy_send_overhead += UCT_IB_GRACE_CPU_BCOPY_OVERHEAD_ADJUSTMENT; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/uct/ib/base/ib_iface.c(2 hunks)
🔇 Additional comments (2)
src/uct/ib/base/ib_iface.c (2)
2040-2040: LGTM: Clean local variable initialization.Creating a local copy of the bcopy overhead allows platform-specific adjustments without modifying the original configuration structure.
2055-2060: LGTM: Correct usage of adjusted overhead for bcopy operations.The adjusted
bcopy_send_overheadis correctly applied only to bcopy operations, which will influence the bcopy vs zcopy threshold calculations as intended. The scope is properly limited by theuct_ep_op_is_bcopy(op)check.
src/uct/ib/base/ib_iface.c
Outdated
|
|
||
| /* NEW: Adjust overhead for Grace CPU to influence threshold calculations */ | ||
| if ((ucs_arch_get_cpu_vendor() == UCS_CPU_VENDOR_NVIDIA) && (ucs_arch_get_cpu_model() == UCS_CPU_MODEL_NVIDIA_GRACE)) { | ||
| bcopy_send_overhead += 2 * 1e-6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually try to make the values reflect the actual characteristics. And the changes made indicate an increase in estimated overhead of allocating a TX buffer on Grace CPU. Is this really the reason?
Maybe it's because of lower overheads associated with zcopy (cqe, estimated overhead of processing a work request completion)?
P.S. It is preferable to make such changes to ucx.conf (as you mentioned in the PR description).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for both comments
Signed-off-by: Roie Danino <[email protected]>
…ing for ack Signed-off-by: Roie Danino <[email protected]>
src/ucp/proto/proto_init.c
Outdated
| if (params->flags & UCP_PROTO_COMMON_INIT_FLAG_SEND_ZCOPY) { | ||
| if (op_attr_mask & UCP_OP_ATTR_FLAG_FAST_CMPL) { | ||
| /* Waits only for cqe completion not for ACK */ | ||
| perf_factors[UCP_PROTO_PERF_FACTOR_LATENCY].c += tl_perf->send_post_overhead; | ||
| } else { | ||
| /* Send time is representing request completion, which in case of zcopy | ||
| waits for ACK from remote side. */ | ||
| perf_factors[UCP_PROTO_PERF_FACTOR_LATENCY].c += tl_perf->latency; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ack is being sent only in RNDV (which doubles latency in measurements) we might need to add a UCP_PROTO_COMMON_INIT_FLAG_RNDV flag
|
the original intent of FAST_COMPLETION protocols is to indicate that the time the operation is blocking, so the time needed to complete the request on the sender side should also be taken into account. for example, with zcopy, there is the additional ACK latency. |
OSU benchmarks, MPI_Send() Looking at the logs, it seems we are not waiting for any ACKs in fast-completion eager zcopy, so it's either a bug or we only wait for ACKs in RNDV? |
…y just in rndv Signed-off-by: Roie Danino <[email protected]>
|
I tried that, but it only switches to zcopy when setting BCOPY_BW=2000, which is 30 times slower than reality |
What?
Fixing the latency estimation for zcopy fast-completion protocol - fast completion does not wait for ack, only to the cqe
Why?
Currently, zcopy latency is overestimated, causing bcopy to be selected in the 2k-8k range, while zcopy performs better