Skip to content

Conversation

@ofirfarjun7
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 commented Nov 19, 2025

What?

Select best device lanes and keep NICs load balanced.

Why?

Multi process applications wants to spread the load between the NICs.
Today it can be achieved only by setting UCX_NET_DEVICES but it require user to be topo aware.

How?

Use user hint to select NIC and keep load balanced.

Summary by CodeRabbit

  • Refactor
    • Improved lane-selection logic to prefer highest-bandwidth and local/non-RMA paths when beneficial while still considering peer-to-peer (RMA) lanes.
    • Collects and ranks candidate lanes, promotes the best choices into primary slots, and handles cases with no valid lanes more robustly.
    • Added clearer logging/tracing for non-RMA and RMA selection decisions.

@coderabbitai
Copy link

coderabbitai bot commented Nov 19, 2025

Walkthrough

Adds a lane descriptor and comparator; updates lane lookup to collect and sort RMA lanes by bandwidth (tie: device name), track the best non‑RMA lane, choose/promote lanes (prefer non‑RMA at lanes[0] when present), log selections, and return early if no valid lanes.

Changes

Cohort / File(s) Change Summary
Device lane selection logic
src/ucp/core/ucp_device.c
Adds ucp_device_lane_info_t (bandwidth, lane, dev_name) and ucp_device_lane_compare. Updates ucp_device_mem_list_lane_lookup() to capture node_local_id, collect RMA (P2P) lanes into a temp array, sort RMA lanes by descending bandwidth (tie: device name), track best non‑RMA lane and bandwidth, select best RMA using node_local_id % rma_count, promote/place selected lanes into lanes[] (favor non‑RMA at lanes[0]), add logging/tracing, and return early if no valid lanes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Caller as Caller
    participant Lookup as ucp_device_mem_list_lane_lookup
    participant Collector as LaneCollector
    participant Sorter as RMA_Sorter
    participant Selector as FinalSelector

    Caller->>Lookup: request lanes for mem-list
    Note right of Lookup: iterate candidates, capture node_local_id
    Lookup->>Collector: collect RMA entries (bw, lane, dev_name)
    Lookup->>Collector: track best non‑RMA lane & bw
    alt RMA entries exist
        Collector->>Sorter: sort by bw desc (tie: dev_name)
        Sorter-->>Selector: sorted RMA list
        Selector->>Lookup: choose index = node_local_id % rma_count
    else
        Collector-->>Selector: no RMA lanes
    end
    alt non‑RMA exists
        Selector->>Lookup: place non‑RMA at lanes[0]
        Selector->>Lookup: place chosen RMA at lanes[1] or promote if bw higher
        Note right of Lookup: trace/log non‑RMA and RMA choices
    else
        Selector->>Lookup: use chosen RMA as lanes[0]
        Note right of Lookup: trace/log RMA choice
    end
    alt no valid lanes
        Lookup-->>Caller: return early (no lanes)
    else
        Lookup-->>Caller: return selected lanes[]
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Single-file change with new struct, comparator, sorting, branching, and lane index manipulation.
  • Pay attention to:
    • Lifetime/ownership of ucp_device_lane_info_t.dev_name.
    • Comparator behavior on equal bandwidths.
    • Off‑by‑one or overwrite risks when promoting/shifting lanes[].
    • Consistent logging/tracing for all code paths including early return.

Poem

🐰 I hop through lanes both quick and slow,

I measure bytes where currents flow.
I sort the P2P, then sniff the best,
I nudge the winner forward, tidy the rest.
Hooray — the chosen lanes now go.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Select device lanes' directly aligns with the main change: implementing device lane selection logic with load balancing for NICs.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/ucp/core/ucp_device.c (2)

283-302: Lane descriptor and comparator look correct; consider NULL‑safety for dev_name

The struct and comparator logic (descending bandwidth, then device name) are straightforward and match the intended ordering. One minor robustness detail: ucp_device_lane_compare unconditionally calls strcmp(lane_a->dev_name, lane_b->dev_name). If any transport ever leaves dev_name as NULL, this will crash inside qsort.

If the contract guarantees dev_name is never NULL, it would be good to document or assert it here:

-    return strcmp(lane_a->dev_name, lane_b->dev_name);
+    ucs_assert(lane_a->dev_name != NULL);
+    ucs_assert(lane_b->dev_name != NULL);
+    return strcmp(lane_a->dev_name, lane_b->dev_name);

Otherwise, consider falling back to an empty string or a stable placeholder when dev_name == NULL.


310-320: Lane selection / load‑balancing logic is sound; a couple of clarity tweaks would help

The updated ucp_device_mem_list_lane_lookup logic correctly:

  • Tracks the best “non‑RMA” lane separately (best_non_rma_lane / best_non_rma_bw).
  • Accumulates other candidate lanes in best_lanes[], sorts them by bandwidth+dev_name, and picks one by node_local_id % num_valid_lanes for load balancing.
  • Prefers the RMA lane over the non‑RMA lane when its bandwidth is higher, while still filling at most two entries in lanes[].

A few small refinements could improve readability and diagnostics without changing behavior:

  1. Make is_non_rma strictly boolean for better traces

Currently:

is_non_rma = (ucp_ep_md_attr(ep, lane)->flags & UCT_MD_FLAG_INVALIDATE_RMA);
...
" ... non_rma=%d", ..., is_non_rma);

is_non_rma will log the raw bitmask, not 0/1. Making it boolean keeps the trace easier to read:

-    is_non_rma = (ucp_ep_md_attr(ep, lane)->flags &
-                  UCT_MD_FLAG_INVALIDATE_RMA);
+    is_non_rma = !!(ucp_ep_md_attr(ep, lane)->flags &
+                    UCT_MD_FLAG_INVALIDATE_RMA);
  1. Avoid relying on best_non_rma_bw == -1.0 as a sentinel

The final selection uses best_lanes[lane].bandwidth > best_non_rma_bw, where best_non_rma_bw is initialized to -1.0. It works, but the coupling between a magic value and control flow is not obvious. A slightly clearer variant would key off the presence of best_non_rma_lane instead:

-    if (best_lanes[lane].bandwidth > best_non_rma_bw) {
-        lanes[1] = lanes[0];
-        lanes[0] = best_lanes[lane].lane;
-    } else {
-        lanes[1] = best_lanes[lane].lane;
-    }
+    if (best_non_rma_lane == UCP_NULL_LANE ||
+        best_lanes[lane].bandwidth > best_non_rma_bw) {
+        lanes[1] = best_non_rma_lane;
+        lanes[0] = best_lanes[lane].lane;
+    } else {
+        lanes[1] = best_lanes[lane].lane;
+    }

Functionally equivalent, but easier to reason about.

  1. Confirm the “non‑RMA” criterion matches intent

You’re classifying lanes as “non‑RMA” based on UCT_MD_FLAG_INVALIDATE_RMA:

is_non_rma = !!(ucp_ep_md_attr(ep, lane)->flags & UCT_MD_FLAG_INVALIDATE_RMA);

This looks intentional but is slightly non‑obvious naming‑wise (a flag mentioning RMA being used to define “non‑RMA”). It might be worth double‑checking that this flag set is exactly the set of lanes you want to prioritize as lanes[0] before RMA lanes, or at least adding a brief comment tying the flag to the “non‑RMA” concept.

Also applies to: 361-405

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4209ddc and 3b2a3c2.

📒 Files selected for processing (1)
  • src/ucp/core/ucp_device.c (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: UCX PR (Static_check Static checks)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)

@ofirfarjun7 ofirfarjun7 force-pushed the topic/select-gpu-rdma-lane branch from 8caa8b4 to 41ffa64 Compare November 19, 2025 15:47
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/ucp/core/ucp_device.c (1)

401-406: Consider explicit null-lane check for clarity.

While the current logic is correct (comparing against best_non_rma_bw = -1.0 when no non-RMA lane exists), the code would be clearer with an explicit check.

Apply this diff for improved clarity:

-        if (best_lanes[lane].bandwidth > best_non_rma_bw) {
+        if ((best_non_rma_lane == UCP_NULL_LANE) ||
+            (best_lanes[lane].bandwidth > best_non_rma_bw)) {
             lanes[1] = lanes[0];
             lanes[0] = best_lanes[lane].lane;
         } else {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3b2a3c2 and 7939f57.

📒 Files selected for processing (1)
  • src/ucp/core/ucp_device.c (2 hunks)
🔇 Additional comments (5)
src/ucp/core/ucp_device.c (5)

283-287: LGTM! Clean descriptor struct.

The struct is well-defined for collecting lane information. The const char *dev_name pointer is appropriate for referencing stable transport resource names.


361-379: Good separation of non-RMA and RMA lanes.

The logic correctly distinguishes between non-RMA lanes (can use copy operations) and RMA lanes (require memory registration). Tracking the best non-RMA lane while collecting all RMA lanes for load balancing is the right approach.


392-408: Excellent load balancing implementation!

The load balancing logic is well-designed:

  • Sorting ensures deterministic ordering across all processes
  • The modulo operation (node_local_id % num_valid_lanes) distributes different processes to different NICs
  • Lane promotion based on bandwidth comparison optimizes for performance

The logic correctly handles all scenarios (no non-RMA, no RMA, both present).


310-310: No issues found. The node_local_id field is always initialized during context creation with a default value of 0 (see ucp_context.c line 2065), and can be overridden via environment variables (lines 2177-2179). By the time an endpoint accesses this field in line 310, the context is fully initialized.


289-302: <!-- [scratchpad_start]
[task review_file_1/1 review_segment_2/3]
Continue searching for ucp_ep_get_tl_rsc definition and dev_name initialization.

[observations]

  • Previous search found usages but not definition
  • usages show: ucp_ep_get_tl_rsc(ep, lane)->dev_name at line 377
  • usages show: ucp_ep_get_tl_rsc(ep, lane)->sys_device at line 344
  • This suggests return type has fields: dev_name, sys_device
  • Likely defined in header file (not .c file)

[analysis]

  • Need to search header files (.h) for the function definition and return type
  • Structure might be uct_tl_resource_desc_t or similar based on macro use

[pending]

  • Find function signature in header files
  • Understand return type and whether dev_name is guaranteed non-NULL
  • Check if return value can be NULL

[actions]
Search for:

  1. ucp_ep_get_tl_rsc in header files (.h)
  2. uct_tl_resource_desc_t or similar structure definition
  3. Check if dev_name field is documented or guaranteed


static int ucp_device_lane_compare(const void *a, const void *b)
{
const ucp_device_lane_info_t *lane_a = (const ucp_device_lane_info_t*)a;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally casting from void * is redundant

ucp_lane_index_t lanes[UCP_DEVICE_MEM_LIST_MAX_EPS])
{
double best_bw[UCP_DEVICE_MEM_LIST_MAX_EPS] = {-1., -1.};
unsigned long node_local_id = ep->worker->context->config.node_local_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run format script and it kept it as is.

/* Sort RMA lanes */
qsort(best_lanes, num_valid_lanes, sizeof(ucp_device_lane_info_t),
ucp_device_lane_compare);
lane = node_local_id % num_valid_lanes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra spaces

bandwidth = ucp_worker_iface_bandwidth(ep->worker,
ucp_ep_get_rsc_index(ep, lane));

is_rma = !!(ucp_ep_md_attr(ep, lane)->flags & UCT_MD_FLAG_NEED_MEMH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. not sure we need the separation between RMA and non-RMA lanes
  2. we could just save the indexes of the valid lanes and then sort, and not save all lanes properties in the array. Use ucs_qsort_r to pass context to get the rest of the info needed for comparison.
  3. can use ucs_array

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants