Skip to content

Conversation

@zzhang37
Copy link

@zzhang37 zzhang37 commented Nov 19, 2025

What?

This PR is to enable Intel Gaudi GDR to UCX, so the UCX will have the capability to use Gaudi GDR to access Gaudi memory using standard UCP API through host NIC.

Why?

Enhanced the UCX to Intel Gaudi devices.

How?

Added Gaudi memory support in USM and the rest are new Gaudi GDR codes.

Summary by CodeRabbit

  • New Features

    • Added HabanaLabs Gaudi support: device discovery, topology-aware mapping to NICs, and a Gaudi transport
    • Exposed Gaudi memory type with DMABUF-based memory handling and device query support
    • Extended GPU Direct RDMA detection to include Gaudi devices
    • Added pkg-config and installation artifacts for the Gaudi module
  • Chores

    • Build/config updated for optional, conditional Gaudi module integration

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 19, 2025

Walkthrough

Adds HabanaLabs Gaudi support: autoconf checks and build integration, new GAUDI memory type, Gaudi topology provider, Gaudi base/SCAL utilities, a Gaudi GDR MD and iface, pkg-config template and Debian install entries, plus IB MD extensions for Gaudi dmabuf detection.

Changes

Cohort / File(s) Summary
Autoconf & configure fragments
config/m4/gaudi.m4, src/uct/configure.m4, src/uct/gaudi/configure.m4
New UCX_CHECK_GAUDI macro and configure logic to detect Gaudi, compute/apply GAUDI_CPPFLAGS/LDFLAGS/LIBS, define HAVE_GAUDI conditional, and register generation of gaudi Makefile and pkg-config template.
Build system / Makefiles
src/uct/Makefile.am, src/uct/gaudi/Makefile.am, src/ucs/Makefile.am, debian/ucx-gaudi.install
Integrates gaudi into SUBDIRS, conditionally includes Gaudi topology headers/sources, declares libuct_gaudi.la and module build rules, and adds Debian install targets for Gaudi libraries.
Memory type additions
src/ucs/memory/memory_type.h, src/ucs/memory/memory_type.c
Adds UCS_MEMORY_TYPE_GAUDI enum entry and corresponding name/description entries ("gaudi", "HabanaLabs Gaudi memory").
Gaudi topology provider
src/ucs/sys/topo/gaudi/topo.h, src/ucs/sys/topo/gaudi/topo.c
New Gaudi topology provider with lazy init, sysfs/device enumeration, PCIe/NUMA distance modeling, connection matrix and balanced Gaudi→HNIC assignment. Exposes init/cleanup/get-index/find-best-connection APIs.
Gaudi base utilities & SCAL API
src/uct/gaudi/base/scal.h, src/uct/gaudi/base/gaudi_base.h, src/uct/gaudi/base/gaudi_base.c
New SCAL types/APIs and Gaudi base helpers to obtain/close device FDs, close dmabuf FDs, resolve sys device, read pool/device info and export dmabuf FD, and query/register Gaudi device resources.
Gaudi GDR MD & headers
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.h, src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c
New GAUDI GDR MD component: resource query, md_open/md_close, mem_query/memory-type handling, dmabuf duplication/management, MD struct/config types, and component registration (uct_gaudi_gdr_component).
Gaudi GDR iface
src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.h, src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.c
New TL iface type uct_gaudi_gdr_iface_t, ctor/dtor, iface query, ops stubs, class/new function, and UCT_TL_DEFINE registration for gaudi_gdr.
Pkg-config template
src/uct/gaudi/ucx-gaudi.pc.in
Adds pkg-config template for the GAUDI UCT module.
IB MD integration
src/uct/ib/base/ib_md.c
Under #ifdef HAVE_GAUDI, extends GPU-direct RDMA detection to include Gaudi device paths (e.g., /dev/accel/accel0, /dev/hl0) when checking dmabuf/GDR support.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant UCX as UCX Core
    participant GaudiMD as Gaudi GDR MD
    participant GaudiBase as Gaudi Base / SCAL
    participant Topo as Gaudi Topology
    participant Sysfs as /sys & /dev

    App->>UCX: init
    UCX->>GaudiMD: query_md_resources()
    GaudiMD->>GaudiBase: uct_gaudi_base_query_devices()
    GaudiBase->>Sysfs: open /dev/accel*, hlthunk, read PCI BDF
    Sysfs-->>GaudiBase: fd, PCI info
    GaudiBase-->>GaudiMD: device resource

    App->>UCX: open md
    UCX->>GaudiMD: uct_gaudi_md_open()
    GaudiMD->>GaudiBase: get_fd / get_info -> dmabuf fd, base, size
    GaudiBase-->>GaudiMD: dmabuf_fd, addresses
    GaudiMD-->>UCX: md handle

    UCX->>Topo: ucs_gaudi_find_best_connection(name)
    Topo->>Sysfs: enumerate accel/hnic, build matrix (lazy)
    Topo-->>UCX: best HNIC + port / UCS_ERR_NO_ELEM
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Focus areas:
    • src/ucs/sys/topo/gaudi/topo.c — sysfs parsing, PCI/NUMA distance modeling, concurrency and lazy-init error paths.
    • src/uct/gaudi/** — md, iface, base: dmabuf handling, FD lifetimes, address/size validation, SCAL interactions, and component registration.
    • Build/config changes — UCX_CHECK_GAUDI, correct propagation of GAUDI flags, Makefile/pc generation and Debian install entries.
    • src/uct/ib/base/ib_md.c — ensure Gaudi checks are correctly conditional and do not regress existing GPU-D behavior.

Poem

🐇 I hopped through sysfs tunnels, sniffed accel and thread,
dmabufs hummed softly where Gaudi memories spread.
I paired hops to NICs, balanced bridges with care,
now packets bound lightly through silicon air. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.10% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: adding Intel Gaudi GDR support to UCX. It is concise, specific, and directly reflects the main objective of the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
src/uct/ib/base/ib_md.c (1)

1344-1348: Gaudi GPUDirect detection looks correct; comment wording is slightly misleading

The new #ifdef HAVE_GAUDI block reuses uct_ib_check_gpudirect_driver consistently with the existing CUDA/ROCm checks and safely gates UCS_MEMORY_TYPE_GAUDI registration on the presence of Gaudi device nodes. This integrates cleanly into the later GPUDirect enablement check.

Minor nit: the comment says “Gaudi DMABuf support”, but this block is only checking for Gaudi device/driver presence; actual dmabuf capability is still probed by uct_ib_md_check_dmabuf(md) below. You might want to rephrase the comment to avoid confusion, e.g.:

-        /* Check for HabanaLabs Gaudi DMABuf support */
+        /* Check for HabanaLabs Gaudi driver (to allow Gaudi GDR via dmabuf) */
src/uct/gaudi/ucx-gaudi.pc.in (1)

7-11: Populate Libs/Libs.private to make the pkg-config entry useful

Right now Libs: and Libs.private: are empty, so pkg-config --libs @PACKAGE@-gaudi will not return any link flags. It would be better to mirror the patterns used by other UCX TL .pc.in files (e.g., add the appropriate -L/-l flags and/or Requires:), so consumers can link Gaudi support just by depending on this pkg-config module.

Please cross-check with the existing UCX TL .pc.in templates in this repo to ensure the Gaudi one follows the same conventions.

src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.h (1)

6-27: Consider tightening include guard name and normalizing field naming

Functionally this header looks fine, but there are a couple of style nits:

  • The include guard GAUDI_MD_H is quite generic; most UCX headers use a more specific name (e.g., UCT_GAUDI_MD_H) to avoid collisions.
  • The uct_gaudi_md_t members mostly use snake_case, but totalSize stands out in camelCase; renaming to total_size would better match the surrounding code.

These are small cleanups and not blocking, but aligning with existing UCX style will help long-term maintainability.

If you decide to rename totalSize, please update all its uses in the corresponding .c files and any other references.

src/ucs/sys/topo/gaudi/topo.h (1)

7-64: API surface looks good; add <stdint.h> for robustness

The Gaudi topology API (init/cleanup, module-id lookup, and best-connection selection) is well documented and matches the implementation behavior in topo.c, including error reporting and lazy initialization semantics.

One minor robustness improvement: this header uses uint32_t but does not include <stdint.h> directly, instead relying on transitive includes from other headers. Adding an explicit include would avoid surprises if those headers ever change:

#include <stdint.h>
#include <ucs/sys/topo/base/topo.h>
#include <ucs/type/status.h>

Please confirm this change compiles cleanly across all supported platforms/toolchains, in case any rely on specific include ordering.

src/uct/gaudi/base/gaudi_base.c (2)

23-31: Clarify synDeviceGetInfo(-1, ...) vs device_id semantics

uct_gaudi_base_get_fd() ignores device_id when synDeviceGetInfo(-1, &deviceInfo) succeeds and always returns deviceInfo.fd. If multiple Gaudi devices are present, that may not correspond to the requested device_id unless the API guarantees that -1 already reflects the intended device and device_id is only for the hlthunk fallback. Consider either passing device_id to synDeviceGetInfo (if supported) or documenting the assumption that device_id is used only on the hlthunk path.


51-95: Add a defensive check before subtracting hbm_pool_start - addr

In uct_gaudi_base_get_info(), offset is computed as hbm_pool_start - addr with both operands uint64_t. If SCAL ever reports device_base_address < device_base_allocated_address, this will underflow and produce a huge offset passed into hlthunk_device_mapped_memory_export_dmabuf_fd(). It would be safer to assert or explicitly check the ordering and fail early, e.g. returning an error if hbm_pool_start < addr, rather than relying on implicit SCAL guarantees.

src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.c (1)

16-27: Double-check max_num_eps = 0 and near-zero bandwidth values

uct_gaudi_gdr_iface_query() overrides the base max_num_eps with 0 and sets an extremely small non-zero dedicated bandwidth. If 0 is not explicitly treated as “unlimited” in UCT, this could be interpreted as “no endpoints supported”. If the TL is only a control/auxiliary iface, documenting this choice or simply preserving iface->super.config.max_num_eps would make the behavior clearer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 03245a7 and 0600a8e.

📒 Files selected for processing (20)
  • config/m4/gaudi.m4 (1 hunks)
  • debian/ucx-gaudi.install (1 hunks)
  • src/ucs/Makefile.am (2 hunks)
  • src/ucs/memory/memory_type.c (2 hunks)
  • src/ucs/memory/memory_type.h (1 hunks)
  • src/ucs/sys/topo/gaudi/topo.c (1 hunks)
  • src/ucs/sys/topo/gaudi/topo.h (1 hunks)
  • src/uct/Makefile.am (1 hunks)
  • src/uct/configure.m4 (1 hunks)
  • src/uct/gaudi/Makefile.am (1 hunks)
  • src/uct/gaudi/base/gaudi_base.c (1 hunks)
  • src/uct/gaudi/base/gaudi_base.h (1 hunks)
  • src/uct/gaudi/base/scal.h (1 hunks)
  • src/uct/gaudi/configure.m4 (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.c (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.h (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.h (1 hunks)
  • src/uct/gaudi/ucx-gaudi.pc.in (1 hunks)
  • src/uct/ib/base/ib_md.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
src/uct/gaudi/base/gaudi_base.h (1)
src/uct/gaudi/base/gaudi_base.c (4)
  • uct_gaudi_base_get_fd (23-31)
  • uct_gaudi_base_get_sysdev (33-49)
  • uct_gaudi_base_get_info (51-95)
  • uct_gaudi_base_query_devices (97-112)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.c (3)
src/uct/base/uct_iface.c (4)
  • uct_base_iface_query (509-515)
  • uct_base_iface_is_reachable (343-355)
  • uct_iface_base_query_v2 (573-581)
  • uct_base_iface_t (721-724)
src/ucs/sys/stubs.c (5)
  • ucs_empty_function (15-17)
  • ucs_empty_function_return_success (49-52)
  • ucs_empty_function_return_unsupported (54-57)
  • ucs_empty_function_return_zero (19-22)
  • ucs_empty_function_return_zero_int (29-32)
src/uct/gaudi/base/gaudi_base.c (1)
  • uct_gaudi_base_query_devices (97-112)
src/ucs/sys/topo/gaudi/topo.h (1)
src/ucs/sys/topo/gaudi/topo.c (4)
  • ucs_gaudi_topo_init (1716-1753)
  • ucs_gaudi_topo_cleanup (1852-1899)
  • ucs_gaudi_get_index_from_module_id (304-342)
  • ucs_gaudi_find_best_connection (1513-1564)
src/uct/gaudi/base/gaudi_base.c (2)
src/ucs/sys/topo/base/topo.c (1)
  • ucs_topo_find_device_by_bdf_name (737-759)
src/uct/base/uct_iface.c (1)
  • uct_single_device_resource (550-571)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (5)
src/uct/base/uct_md.c (3)
  • uct_md_base_md_query (505-523)
  • uct_md_query_single_md_resource (144-162)
  • uct_md_stub_rkey_unpack (173-181)
src/ucs/debug/memtrack.c (2)
  • ucs_free (368-372)
  • ucs_malloc (328-334)
src/ucs/memory/memtype_cache.c (1)
  • ucs_memtype_cache_update (266-277)
src/ucs/sys/stubs.c (3)
  • ucs_empty_function_return_unsupported (54-57)
  • ucs_empty_function_return_success (49-52)
  • ucs_empty_function (15-17)
src/uct/gaudi/base/gaudi_base.c (3)
  • uct_gaudi_base_get_fd (23-31)
  • uct_gaudi_base_get_info (51-95)
  • uct_gaudi_base_get_sysdev (33-49)
src/ucs/sys/topo/gaudi/topo.c (6)
src/ucs/sys/topo/base/topo.c (6)
  • ucs_topo_get_device_bus_id (355-364)
  • ucs_topo_num_devices (211-220)
  • ucs_topo_sys_device_get_name (784-801)
  • ucs_topo_find_device_by_bdf_name (737-759)
  • ucs_topo_sys_device_set_name (761-782)
  • ucs_topo_sys_device_get_numa_node (803-820)
src/ucs/sys/string.c (5)
  • ucs_strncpy_safe (235-249)
  • ucs_snprintf_safe (226-233)
  • ucs_string_alloc_path_buffer (454-465)
  • ucs_path_get_common_parent (373-381)
  • ucs_path_calc_distance (383-389)
src/ucs/debug/memtrack.c (3)
  • ucs_free (368-372)
  • ucs_calloc (336-342)
  • ucs_malloc (328-334)
src/ucs/sys/sys.c (3)
  • ucs_sys_read_sysfs_file (1634-1654)
  • ucs_make_affinity_str (1312-1344)
  • ucs_sys_pthread_getaffinity (1376-1384)
src/ucs/memory/numa.c (4)
  • ucs_numa_num_configured_cpus (110-122)
  • ucs_numa_node_of_cpu (124-152)
  • ucs_numa_num_configured_nodes (95-108)
  • ucs_numa_distance (232-254)
src/ucs/type/spinlock.h (3)
  • ucs_spin_lock (81-84)
  • ucs_spin_unlock (127-130)
  • ucs_spinlock_init (43-59)
🪛 Clang (14.0.6)
src/uct/gaudi/base/scal.h

[error] 4-4: 'stdint.h' file not found

(clang-diagnostic-error)

src/uct/gaudi/base/gaudi_base.h

[error] 9-9: 'uct/base/uct_iface.h' file not found

(clang-diagnostic-error)

src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.h

[error] 9-9: 'uct/base/uct_iface.h' file not found

(clang-diagnostic-error)

src/ucs/sys/topo/gaudi/topo.h

[error] 10-10: 'ucs/sys/topo/base/topo.h' file not found

(clang-diagnostic-error)

src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c

[error] 10-10: 'ucs/memory/memtype_cache.h' file not found

(clang-diagnostic-error)

src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.h

[error] 9-9: 'uct/base/uct_md.h' file not found

(clang-diagnostic-error)

🔇 Additional comments (12)
debian/ucx-gaudi.install (1)

1-2: Install entries look correct; please confirm paths match built GAUDI artifacts

The two entries align with the new GAUDI perf-test and UCT libraries; just double‑check that the actual .so names and install directories produced by the build (including any multi‑arch directory like ...-linux-gnu) match these paths and the rest of the Debian packaging for UCX.

src/ucs/Makefile.am (2)

93-96: GAUDI topo header conditional looks good; verify path and macro wiring

Conditionally adding sys/topo/gaudi/topo.h to nobase_dist_libucs_la_HEADERS under HAVE_GAUDI is consistent with the new topology support and keeps GAUDI headers out when the feature is disabled. Please just verify:

  • The header really lives at src/ucs/sys/topo/gaudi/topo.h, and
  • HAVE_GAUDI is the same Automake conditional defined in your new UCX_CHECK_GAUDI logic (so this block is never evaluated before the conditional is set).

228-231: GAUDI topo source inclusion is correct and properly guarded

Adding sys/topo/gaudi/topo.c to libucs_la_SOURCES under if HAVE_GAUDI correctly isolates GAUDI‑specific code from non‑GAUDI builds. The indentation tweak for memcpy_thunderx2.S is cosmetic and has no functional effect.

Also applies to: 235-235

config/m4/gaudi.m4 (1)

58-60: Ensure UCX_CHECK_GAUDI is invoked so HAVE_GAUDI conditional exists for Makefiles

You define AM_CONDITIONAL([HAVE_GAUDI], [test "x$gaudi_happy" != xno]) here and then use if HAVE_GAUDI in multiple Makefile.ams. Please verify that UCX_CHECK_GAUDI is actually invoked from configure.ac (or a top‑level UCX config macro) before AC_OUTPUT, otherwise Automake will complain that conditional HAVE_GAUDI is not defined.

src/uct/Makefile.am (1)

10-10: Including gaudi in UCT SUBDIRS is appropriate; confirm gaudi subdir is no‑op when disabled

Adding gaudi alongside cuda/rocm/... ensures the GAUDI transport is built as part of the normal UCT build. Please just confirm that src/uct/gaudi/Makefile.am fully guards its targets with HAVE_GAUDI (or similar), so that when GAUDI is not available the gaudi subdir effectively becomes a no‑op and doesn’t introduce spurious build failures.

src/uct/configure.m4 (1)

7-15: GAUDI UCT configure fragment inclusion is consistent with other transports

Including src/uct/gaudi/configure.m4 alongside the other transport fragments is the right way to hook GAUDI module configuration into UCT. Please verify that the gaudi fragment:

  • Appends gaudi to uct_modules when enabled, and
  • Invokes UCX_CHECK_GAUDI (or otherwise ensures HAVE_GAUDI/GAUDI_* flags are set)

so that the new transport is both discoverable and correctly conditioned.

src/ucs/memory/memory_type.h (1)

38-51: GAUDI memory type addition looks consistent

Adding UCS_MEMORY_TYPE_GAUDI before UCS_MEMORY_TYPE_LAST matches the existing enum style and is correctly positioned for use with the ucs_memory_type_for_each macro.

Please ensure all switch statements over ucs_memory_type_t in the rest of the codebase are updated to handle UCS_MEMORY_TYPE_GAUDI where appropriate (e.g., mapping to MDs/UCTs, logging, config parsing).

src/ucs/memory/memory_type.c (1)

17-44: Names/descriptions for GAUDI are wired correctly

The ucs_memory_type_names and ucs_memory_type_descs entries for UCS_MEMORY_TYPE_GAUDI are consistent with the enum and the existing naming scheme; sentinel entries for UCS_MEMORY_TYPE_LAST remain intact.

If there are helper routines that map string names to ucs_memory_type_t (e.g., config parsing), please confirm they also recognize "gaudi" so configuration and logging remain consistent.

src/uct/gaudi/gaudi_gdr/gaudi_gdr_iface.h (1)

6-15: Gaudi GDR iface definition is minimal and consistent

The TL name macro and uct_gaudi_gdr_iface_t wrapper over uct_base_iface_t follow existing UCT conventions; include guard and header dependency look correct. The reported uct/base/uct_iface.h file not found from static analysis is almost certainly an include-path artifact, not a code issue.

src/uct/gaudi/base/gaudi_base.h (1)

1-20: Gaudi base header API looks consistent with the implementation

Function prototypes and types line up with gaudi_base.c and existing UCT/UCT topo APIs; nothing blocking here from a header/API perspective.

src/uct/gaudi/Makefile.am (1)

6-38: GAUDI module Makefile wiring looks correct

Conditional build, sources, headers, and linkage flags are wired the same way as other UCT transports; no functional issues spotted here.

src/uct/gaudi/base/scal.h (1)

1-31: SCAL header is self-contained and matches current usage

Opaque handle definitions, scal_memory_pool_infoV2, and function prototypes line up with how they’re used in gaudi_base.c; nothing blocking from this header as written.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (1)

35-36: Remove unsupported registration capability advertisement.

Lines 35-36 advertise reg_mem_types for HOST and GAUDI memory, but line 143 shows .mem_reg returns UCS_ERR_UNSUPPORTED. This inconsistency will cause UCP to believe registration is available and fail at runtime.

Apply this diff:

-    attr->reg_mem_types    = UCS_BIT(UCS_MEMORY_TYPE_HOST) |
-	                     UCS_BIT(UCS_MEMORY_TYPE_GAUDI);
+    attr->reg_mem_types    = 0;
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0600a8e and 94a58ed.

📒 Files selected for processing (2)
  • src/ucs/sys/topo/gaudi/topo.c (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/ucs/sys/topo/gaudi/topo.c (7)
src/ucs/sys/topo/base/topo.c (6)
  • ucs_topo_get_device_bus_id (355-364)
  • ucs_topo_num_devices (211-220)
  • ucs_topo_sys_device_get_name (784-801)
  • ucs_topo_find_device_by_bdf_name (737-759)
  • ucs_topo_sys_device_set_name (761-782)
  • ucs_topo_sys_device_get_numa_node (803-820)
src/ucs/sys/string.c (5)
  • ucs_strncpy_safe (235-249)
  • ucs_snprintf_safe (226-233)
  • ucs_string_alloc_path_buffer (454-465)
  • ucs_path_get_common_parent (373-381)
  • ucs_path_calc_distance (383-389)
src/ucs/debug/memtrack.c (3)
  • ucs_free (368-372)
  • ucs_calloc (336-342)
  • ucs_malloc (328-334)
src/ucs/sys/sys.c (3)
  • ucs_sys_read_sysfs_file (1634-1654)
  • ucs_make_affinity_str (1312-1344)
  • ucs_sys_pthread_getaffinity (1376-1384)
src/ucs/memory/numa.c (4)
  • ucs_numa_num_configured_cpus (110-122)
  • ucs_numa_node_of_cpu (124-152)
  • ucs_numa_num_configured_nodes (95-108)
  • ucs_numa_distance (232-254)
src/ucs/type/spinlock.h (3)
  • ucs_spin_lock (81-84)
  • ucs_spin_unlock (127-130)
  • ucs_spinlock_init (43-59)
src/ucs/datastruct/list.h (1)
  • ucs_list_del (106-110)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (5)
src/uct/base/uct_md.c (3)
  • uct_md_base_md_query (505-523)
  • uct_md_query_single_md_resource (144-162)
  • uct_md_stub_rkey_unpack (173-181)
src/ucs/debug/memtrack.c (2)
  • ucs_free (368-372)
  • ucs_malloc (328-334)
src/ucs/memory/memtype_cache.c (1)
  • ucs_memtype_cache_update (266-277)
src/ucs/sys/stubs.c (3)
  • ucs_empty_function_return_unsupported (54-57)
  • ucs_empty_function_return_success (49-52)
  • ucs_empty_function (15-17)
src/uct/gaudi/base/gaudi_base.c (3)
  • uct_gaudi_base_get_fd (23-31)
  • uct_gaudi_base_get_info (51-95)
  • uct_gaudi_base_get_sysdev (33-49)
🪛 Clang (14.0.6)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c

[error] 10-10: 'ucs/memory/memtype_cache.h' file not found

(clang-diagnostic-error)

🔇 Additional comments (8)
src/ucs/sys/topo/gaudi/topo.c (5)

1718-1722: Spinlock initialization issue from past review has been addressed.

The past review concern about uninitialized spinlock has been properly fixed. The pthread_once mechanism now ensures the spinlock is initialized before any ucs_spin_lock calls in both ucs_gaudi_topo_init and ucs_gaudi_lazy_init.

Also applies to: 1741-1747, 1773-1779


121-158: Helper function correctly manages realpath allocation.

The realpath(path, NULL) call allocates memory that is properly freed at line 155. Error handling is correct.


627-721: Distance estimation logic is sound.

The NUMA-aware distance estimation with PCIe hop counting is well-structured. The fallback handling for undefined NUMA nodes and the incremental latency/bandwidth degradation per hop are reasonable heuristics.


1179-1513: Balanced assignment algorithm is well-designed.

The two-pass NUMA-aware assignment with epsilon-based tie-breaking is sophisticated and correct. Memory management is proper with all temporary allocations freed. The soft-cap preference in Pass 2 elegantly handles NUMA nodes without local NICs.


1866-1911: Cleanup function properly manages resources and locks.

The cleanup sequence correctly:

  1. Guards against double-cleanup with provider_added check
  2. Takes locks in proper order (mutex then spinlock)
  3. Removes provider from global list
  4. Frees all allocations
  5. Resets state
  6. Releases locks
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (3)

42-52: Device FD is now properly closed on teardown.

The past review issue has been fixed. Lines 48-50 now correctly close md->fd before freeing the MD structure, preventing FD leaks.


54-74: Memory attribute query correctly validates address range.

The range check at lines 63-66 ensures only addresses within the Gaudi device memory are accepted. The helper correctly populates memory info without duplicating the FD (duplication happens in the caller).


110-110: FD duplication is correct for dmabuf field.

Line 110 properly duplicates the dmabuf FD so the caller owns an independent file descriptor. This follows UCX conventions for FD passing.

zzhang37 pushed a commit to intel-staging/ucx that referenced this pull request Nov 21, 2025
*UCT/GAUDI: Free the fds when called md_close
zzhang37 pushed a commit to intel-staging/ucx that referenced this pull request Nov 21, 2025
*UCT/GAUDI: Clean unused configuration
zzhang37 pushed a commit to intel-staging/ucx that referenced this pull request Nov 21, 2025
*CONFIG/M4: Clean config/m4 for gaudi
zzhang37 pushed a commit to intel-staging/ucx that referenced this pull request Nov 21, 2025
*UCT/GAUDI: Remove unused reg_mem type
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/ucs/sys/topo/gaudi/topo.c (1)

347-524: Tighten Gaudi enumeration and connection-matrix edge cases to avoid “ghost” devices and spurious ENOMEM

In ucs_gaudi_enumerate_devices() and the downstream connection/assignment logic there are a couple of robustness gaps:

  1. num_gaudi_devices is not reduced to the number actually mapped into UCX topo

    • num_gaudi_devices is first set from the count of accel* directories (Lines 375–389).
    • During the populate pass (Lines 444–480), some entries can be skipped (e.g. failed PCI addr read or ucs_topo_find_device_by_bdf_name() failure), leaving those slots zero‑initialized. However, ucs_gaudi_topo_ctx.num_gaudi_devices remains the original directory count.
    • Later code (e.g. ucs_gaudi_create_connection_matrix() and ucs_gaudi_build_assignment_balanced()) iterates up to num_gaudi_devices and will treat those trailing zero entries as real Gaudi devices, building connections and assignments for “ghost” devices that can never be addressed by name.

    A contained improvement is to clamp the count after the populate loop:

        }
    -
    +    /* Only keep Gaudi devices that were successfully mapped into UCX topo */
    +    ucs_gaudi_topo_ctx.num_gaudi_devices = gaudi_idx;
  2. ucs_gaudi_create_connection_matrix() treats max_num_connections == 0 as ENOMEM

    When num_gaudi_devices or num_hnic_devices is 0, max_num_connections becomes 0 (Lines 733–734), ucs_calloc(0, ...) can legitimately return NULL, and the function translates this into UCS_ERR_NO_MEMORY (Lines 735–740). That’s misleading: “no devices” is not an allocation failure.

    You can avoid this by short‑circuiting the zero case:

        max_num_connections = ucs_gaudi_topo_ctx.num_gaudi_devices *
                              ucs_gaudi_topo_ctx.num_hnic_devices;
    +    if (max_num_connections == 0) {
    +        ucs_gaudi_topo_ctx.num_connections = 0;
    +        return UCS_OK;
    +    }
    +    ucs_gaudi_topo_ctx.connections = ucs_calloc(max_num_connections,
    +                                                sizeof(ucs_gaudi_connection_t),
    +                                                "gaudi_connections");
    +    if (!ucs_gaudi_topo_ctx.connections) {
    +        return UCS_ERR_NO_MEMORY;
    +    }

    This plays nicely with ucs_gaudi_build_assignment_balanced(), which already short‑circuits the “no devices to assign” case (Lines 1247–1252).

These changes are not strictly required for correctness in the happy path, but they make behaviour more predictable when sysfs/topology is incomplete or partially broken and avoid surprising “virtual” Gaudis and fake ENOMEM.

Also applies to: 720-780, 1155-1510

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94a58ed and 7ad5290.

📒 Files selected for processing (7)
  • config/m4/gaudi.m4 (1 hunks)
  • src/ucs/sys/topo/gaudi/topo.c (1 hunks)
  • src/uct/gaudi/base/gaudi_base.c (1 hunks)
  • src/uct/gaudi/base/gaudi_base.h (1 hunks)
  • src/uct/gaudi/configure.m4 (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • config/m4/gaudi.m4
🧰 Additional context used
🧬 Code graph analysis (4)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (5)
src/uct/base/uct_md.c (3)
  • uct_md_base_md_query (505-523)
  • uct_md_query_single_md_resource (144-162)
  • uct_md_stub_rkey_unpack (173-181)
src/uct/gaudi/base/gaudi_base.c (5)
  • uct_gaudi_base_close_dmabuf_fd (48-52)
  • uct_gaudi_base_close_fd (42-46)
  • uct_gaudi_base_get_fd (23-40)
  • uct_gaudi_base_get_info (72-116)
  • uct_gaudi_base_get_sysdev (54-70)
src/ucs/debug/memtrack.c (2)
  • ucs_free (368-372)
  • ucs_malloc (328-334)
src/ucs/memory/memtype_cache.c (1)
  • ucs_memtype_cache_update (266-277)
src/ucs/sys/stubs.c (3)
  • ucs_empty_function_return_unsupported (54-57)
  • ucs_empty_function_return_success (49-52)
  • ucs_empty_function (15-17)
src/ucs/sys/topo/gaudi/topo.c (6)
src/ucs/sys/topo/base/topo.c (6)
  • ucs_topo_get_device_bus_id (355-364)
  • ucs_topo_num_devices (211-220)
  • ucs_topo_sys_device_get_name (784-801)
  • ucs_topo_find_device_by_bdf_name (737-759)
  • ucs_topo_sys_device_set_name (761-782)
  • ucs_topo_sys_device_get_numa_node (803-820)
src/ucs/sys/string.c (3)
  • ucs_strncpy_safe (235-249)
  • ucs_snprintf_safe (226-233)
  • ucs_string_alloc_path_buffer (454-465)
src/ucs/debug/memtrack.c (3)
  • ucs_free (368-372)
  • ucs_calloc (336-342)
  • ucs_malloc (328-334)
src/ucs/sys/sys.c (3)
  • ucs_sys_read_sysfs_file (1634-1654)
  • ucs_make_affinity_str (1312-1344)
  • ucs_sys_pthread_getaffinity (1376-1384)
src/ucs/memory/numa.c (4)
  • ucs_numa_num_configured_cpus (110-122)
  • ucs_numa_node_of_cpu (124-152)
  • ucs_numa_num_configured_nodes (95-108)
  • ucs_numa_distance (232-254)
src/ucs/type/spinlock.h (3)
  • ucs_spin_lock (81-84)
  • ucs_spin_unlock (127-130)
  • ucs_spinlock_init (43-59)
src/uct/gaudi/base/gaudi_base.h (1)
src/uct/gaudi/base/gaudi_base.c (6)
  • uct_gaudi_base_get_fd (23-40)
  • uct_gaudi_base_close_fd (42-46)
  • uct_gaudi_base_close_dmabuf_fd (48-52)
  • uct_gaudi_base_get_sysdev (54-70)
  • uct_gaudi_base_get_info (72-116)
  • uct_gaudi_base_query_devices (118-133)
src/uct/gaudi/base/gaudi_base.c (3)
bindings/go/src/ucx/ucs_constants.go (2)
  • UCS_ERR_IO_ERROR (62-62)
  • UCS_ERR_INVALID_ADDR (66-66)
src/ucs/sys/topo/base/topo.c (1)
  • ucs_topo_find_device_by_bdf_name (737-759)
src/uct/base/uct_iface.c (1)
  • uct_single_device_resource (550-571)
🪛 Clang (14.0.6)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c

[error] 10-10: 'ucs/memory/memtype_cache.h' file not found

(clang-diagnostic-error)

src/uct/gaudi/base/gaudi_base.h

[error] 9-9: 'stdbool.h' file not found

(clang-diagnostic-error)

src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.h

[error] 9-9: 'stdbool.h' file not found

(clang-diagnostic-error)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (1)
src/ucs/sys/topo/gaudi/topo.c (1)

40-44: Spinlock initialization and init/cleanup ordering now look correct

The revised initialization and teardown flow for the Gaudi topo context is solid:

  • gaudi_init_mutex + provider_added gate provider registration and cleanup.
  • ucs_gaudi_spinlock_once_init() with pthread_once (Lines 1715–1719, 1738–1744, 1770–1776) guarantees ucs_gaudi_topo_ctx.lock is initialized exactly once, regardless of whether ucs_gaudi_topo_init() or ucs_gaudi_lazy_init() runs first.
  • ucs_gaudi_topo_cleanup() only takes the spinlock if provider_added is set (Lines 1868–1875), matching the invariant that successful provider registration implies a successfully initialized spinlock.
  • Lock ordering is consistent (gaudi_init_mutexucs_gaudi_topo_ctx.lock), so there is no obvious deadlock cycle with lazy init.

This addresses the earlier uninitialized-spinlock UB and gives a clean story for concurrent init/lazy-init/cleanup.

Also applies to: 1715-1777, 1863-1908

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (2)

54-60: Prefer integer (or byte-pointer) arithmetic for address range and offset computations

The range check currently compares addr with begin/end as void *, which is technically undefined for unrelated pointers and can upset some analyzers. You can make this more robust and explicit by doing the math in uintptr_t (or uint8_t *) and reusing that representation when computing dmabuf_offset:

-    void *begin = (void *)gaudi_md->device_base_address;
-    void *end = (uint8_t *)begin + gaudi_md->totalSize;
-
-    if ((addr < begin) || (addr >= end)) {
+    uintptr_t begin = gaudi_md->device_base_address;
+    uintptr_t end   = begin + gaudi_md->totalSize;
+    uintptr_t a     = (uintptr_t)addr;
+
+    if ((a < begin) || (a >= end)) {
         mem_info->type = UCS_MEMORY_TYPE_LAST;
         return UCS_ERR_OUT_OF_RANGE;
     }
@@
-    if (mem_attr_p->field_mask & UCT_MD_MEM_ATTR_FIELD_DMABUF_OFFSET) {
-        mem_attr_p->dmabuf_offset = UCS_PTR_BYTE_DIFF(mem_info.base_address,
-                                                      addr);
-    }
+    if (mem_attr_p->field_mask & UCT_MD_MEM_ATTR_FIELD_DMABUF_OFFSET) {
+        mem_attr_p->dmabuf_offset =
+                (ptrdiff_t)((uintptr_t)addr -
+                            (uintptr_t)mem_info.base_address);
+    }

This keeps the behavior the same while making the intent clearer and avoiding undefined pointer relational operations.

Also applies to: 111-114


49-50: Either use length for bounds or mark it explicitly unused

length is accepted by uct_gaudi_md_query_attributes, uct_gaudi_md_mem_query, and uct_gaudi_md_detect_memory_type but never used. That’s fine from a semantics standpoint, but with -Wextra it may generate unused-parameter warnings.

Two options:

  • If you expect callers to pass meaningful ranges, extend the bounds check to ensure [addr, addr + length) stays within the device region.
  • If range length is intentionally ignored, mark it unused to keep builds quiet, e.g.:
uct_gaudi_md_query_attributes(uct_md_h md, const void *addr, size_t length,
                              ucs_memory_info_t *mem_info, int *dmabuf_fd)
{
    (void)length; /* intentionally unused */
    ...
}

(Or use whatever ucs_unused-style macro UCX prefers.)

Also applies to: 70-73, 119-121

src/ucs/sys/topo/gaudi/topo.c (4)

80-95: Mirroring internal topo provider structs is ABI-fragile.

compatible_topo_ops_t / compatible_topo_provider_t mirror the current layout of ucs_sys_topo_ops_t / ucs_sys_topo_provider_t from topo.c and are then inserted into ucs_sys_topo_providers_list. This works only as long as the internal struct layout (field order, types, padding) does not change; any future change in topo.c could silently break this provider.

If possible, consider:

  • Moving the real ucs_sys_topo_provider_t definition into a shared header, or
  • Adding a registration API that hides the provider struct layout and just takes function pointers and a name.

That would make the Gaudi provider robust against internal topology changes.


266-304: Consider adding a range check when parsing module_id to avoid silent truncation.

ucs_gaudi_read_module_id() parses the module ID into an unsigned long and then casts to uint32_t without checking for overflow, unlike ucs_gaudi_read_vendor_id() which guards against val > UINT16_MAX. While sysfs is unlikely to expose an out-of-range module_id, adding a symmetric check would prevent silent truncation and keep behavior consistent with vendor parsing.

A minimal adjustment:

-    errno = 0;
-    val   = strtoul(buffer, &endptr, 10);
+    errno = 0;
+    val   = strtoul(buffer, &endptr, 10);
@@
-    if (errno != 0 || endptr == buffer || *endptr != '\0') {
+    if (errno != 0 || endptr == buffer || *endptr != '\0') {
         ucs_debug("Invalid module ID in %s: '%s'", path, buffer);
         return UCS_ERR_INVALID_PARAM;
     }
-
-    *module_id = (uint32_t)val;
-    return UCS_OK;
+    if (val > UINT32_MAX) {
+        ucs_debug("Module ID %lu exceeds uint32_t in %s", val, path);
+        return UCS_ERR_INVALID_PARAM;
+    }
+
+    *module_id = (uint32_t)val;
+    return UCS_OK;

(Assuming UINT32_MAX is already available via existing headers; otherwise, include it in the project-preferred way.)


351-400: Clarify semantics for “no Gaudi / no HNIC” vs. genuine initialization failures.

ucs_gaudi_enumerate_devices() returns UCS_ERR_NO_DEVICE when there are no Gaudi devices (lines 395–400) or no Mellanox/Broadcom NICs (lines 419–424). ucs_gaudi_lazy_init() (lines 1762–1796) then propagates this as an error, which in turn causes:

  • ucs_gaudi_get_distance() to log an error and return UCS_ERR_NO_DEVICE, and
  • ucs_gaudi_find_best_connection() to treat it as a hard failure (it only special-cases UCS_ERR_UNSUPPORTED as “provider disabled”).

On nodes where “no Gaudi / no RoCE NIC” is expected (common in mixed clusters), this may be more naturally treated as “provider not applicable” rather than an error, depending on how the global topo framework reacts to non-OK provider statuses.

I’d suggest double-checking the intended behavior:

  • If “no Gaudi present” should not be an error for generic topology queries, consider mapping UCS_ERR_NO_DEVICE from ucs_gaudi_enumerate_devices() to UCS_ERR_UNSUPPORTED inside ucs_gaudi_lazy_init(), and then having ucs_gaudi_find_best_connection() continue to return UCS_ERR_NO_ELEM in that case.
  • If you do want “no Gaudi / no NIC” to be a hard error for Gaudi-specific code but benign for generic topo, ensure the caller side (e.g., the topo provider dispatcher) explicitly treats UCS_ERR_NO_DEVICE from this provider as ignorable.

Also applies to: 419-424, 1762-1796


1606-1620: Sysfs vendor-ID lookups on every distance/memory query may be unnecessarily expensive.

Both ucs_gaudi_get_distance() (vendor checks at lines 1606–1620) and ucs_gaudi_get_memory_distance() (lines 1667–1671) call ucs_gaudi_read_vendor_id() on each invocation. That routine does ucs_gaudi_sys_dev_to_sysfs_path() plus fopen/fgets on /sys, which can be relatively expensive if these hooks are hit frequently in hot paths.

Given you already have:

  • ucs_gaudi_topo_ctx.gaudi_devices[] and ucs_gaudi_topo_ctx.hnic_devices[], and
  • ucs_gaudi_topo_ctx.hnic_vendor_ids[],

you could avoid repeated sysfs I/O by:

  • Caching vendor IDs per ucs_sys_device_t in the Gaudi context (e.g., a small array or map), or
  • First checking whether device is known to be a Gaudi or HNIC via the existing arrays, and only falling back to ucs_gaudi_read_vendor_id() when not found.

This is not correctness-critical, but could reduce overhead if these functions are queried often.

Also applies to: 1667-1671

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ad5290 and 8a2e69e.

📒 Files selected for processing (3)
  • src/ucs/sys/topo/gaudi/topo.c (1 hunks)
  • src/ucs/sys/topo/gaudi/topo.h (1 hunks)
  • src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/ucs/sys/topo/gaudi/topo.c (7)
src/ucs/sys/topo/base/topo.c (6)
  • ucs_topo_get_device_bus_id (355-364)
  • ucs_topo_num_devices (211-220)
  • ucs_topo_sys_device_get_name (784-801)
  • ucs_topo_find_device_by_bdf_name (737-759)
  • ucs_topo_sys_device_set_name (761-782)
  • ucs_topo_sys_device_get_numa_node (803-820)
src/ucs/sys/string.c (5)
  • ucs_strncpy_safe (235-249)
  • ucs_snprintf_safe (226-233)
  • ucs_string_alloc_path_buffer (454-465)
  • ucs_path_get_common_parent (373-381)
  • ucs_path_calc_distance (383-389)
src/ucs/debug/memtrack.c (3)
  • ucs_free (368-372)
  • ucs_calloc (336-342)
  • ucs_malloc (328-334)
src/ucs/sys/sys.c (3)
  • ucs_sys_read_sysfs_file (1634-1654)
  • ucs_make_affinity_str (1312-1344)
  • ucs_sys_pthread_getaffinity (1376-1384)
src/ucs/memory/numa.c (4)
  • ucs_numa_num_configured_cpus (110-122)
  • ucs_numa_node_of_cpu (124-152)
  • ucs_numa_num_configured_nodes (95-108)
  • ucs_numa_distance (232-254)
src/ucs/type/spinlock.h (3)
  • ucs_spin_lock (81-84)
  • ucs_spin_unlock (127-130)
  • ucs_spinlock_init (43-59)
src/ucs/datastruct/list.h (1)
  • ucs_list_del (106-110)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c (4)
src/uct/base/uct_md.c (3)
  • uct_md_base_md_query (505-523)
  • uct_md_query_single_md_resource (144-162)
  • uct_md_stub_rkey_unpack (173-181)
src/uct/gaudi/base/gaudi_base.c (5)
  • uct_gaudi_base_close_dmabuf_fd (48-52)
  • uct_gaudi_base_close_fd (42-46)
  • uct_gaudi_base_get_fd (23-40)
  • uct_gaudi_base_get_info (72-116)
  • uct_gaudi_base_get_sysdev (54-70)
src/ucs/memory/memtype_cache.c (1)
  • ucs_memtype_cache_update (266-277)
src/ucs/sys/stubs.c (3)
  • ucs_empty_function_return_unsupported (54-57)
  • ucs_empty_function_return_success (49-52)
  • ucs_empty_function (15-17)
src/ucs/sys/topo/gaudi/topo.h (1)
src/ucs/sys/topo/gaudi/topo.c (4)
  • ucs_gaudi_topo_init (1725-1760)
  • ucs_gaudi_topo_cleanup (1867-1911)
  • ucs_gaudi_get_index_from_module_id (307-348)
  • ucs_gaudi_find_best_connection (1516-1567)
🪛 Clang (14.0.6)
src/uct/gaudi/gaudi_gdr/gaudi_gdr_md.c

[error] 10-10: 'ucs/memory/memtype_cache.h' file not found

(clang-diagnostic-error)

src/ucs/sys/topo/gaudi/topo.h

[error] 10-10: 'ucs/sys/topo/base/topo.h' file not found

(clang-diagnostic-error)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: UCX PR (Codestyle commit title)
  • GitHub Check: UCX PR (Codestyle AUTHORS file update check)
  • GitHub Check: UCX PR (Codestyle format code)
  • GitHub Check: UCX PR (Codestyle ctags check)
  • GitHub Check: UCX PR (Codestyle codespell check)
  • GitHub Check: UCX release DRP (Prepare CheckRelease)
  • GitHub Check: UCX release (Prepare CheckRelease)
  • GitHub Check: UCX snapshot (Prepare Check)
🔇 Additional comments (1)
src/ucs/sys/topo/gaudi/topo.h (1)

15-60: Gaudi topology public API and docs match the implementation.

The prototypes and documented semantics (init/cleanup pairing, -1 sentinel for ucs_gaudi_get_index_from_module_id(), and detailed status codes for ucs_gaudi_find_best_connection()) align with the definitions and behavior in topo.c. Nothing blocking here from an API or documentation standpoint.

@zzhang37 zzhang37 force-pushed the intel_gaudi_gdr_enabling_0 branch 6 times, most recently from ee9a8df to b108672 Compare November 25, 2025 17:13
…on all the comments

UCS: Added Intel gaudi topolody for gaudi devices
@zzhang37 zzhang37 force-pushed the intel_gaudi_gdr_enabling_0 branch from b108672 to b0bc79f Compare November 25, 2025 18:07
@zzhang37
Copy link
Author

The failure seems unrelated to our PR codes, anyone can help here to resolve the issue?

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial review

/* check if ROCM KFD driver is loaded */
uct_ib_check_gpudirect_driver(md, "/dev/kfd", UCS_MEMORY_TYPE_ROCM);

#ifdef HAVE_GAUDI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove this ifdef


SUBDIRS = .

module_LTLIBRARIES = libuct_gaudi.la
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: align = on column

esoha-nvidia <[email protected]>
lyu <[email protected]>
lzhang2 <[email protected]>
nileshnegi <[email protected]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like unrelated changes are added to the AUTHORS file

Comment on lines +1 to +9
/**
* Copyright (C) Intel Corporation, 2025. ALL RIGHTS RESERVED.
*
* See file LICENSE for terms.
*/

#ifdef HAVE_CONFIG_H
#include "config.h"
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code in this file seems gaudi-specific, can we move it to uct/gaudi (to the transport)?

#include <hlthunk.h>
#include <synapse_api.h>

int uct_gaudi_base_get_fd(int device_id, bool *fd_created) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls check clang-format output (here, { should be on next line)


#include <stdint.h>

#define SCAL_SUCCESS 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use UCT_GAUID_ prefix for identifiers in gaudi transport

* Otherwise, the device is opened via hlthunk_open_by_module_id function.
*/
rc = scal_init(fd, "", &scal_handle, NULL);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use spaces between blocks

ucs_status_t uct_gaudi_base_get_sysdev(int fd, ucs_sys_device_t* sys_dev) {
ucs_status_t status;
char pci_bus_id[13];
int rc = hlthunk_get_pci_bus_id_from_fd(fd, pci_bus_id, sizeof(pci_bus_id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't use c99 style declarations (even in technically it compiles)

Suggested change
int rc = hlthunk_get_pci_bus_id_from_fd(fd, pci_bus_id, sizeof(pci_bus_id));
int rc;
rc = hlthunk_get_pci_bus_id_from_fd(fd, pci_bus_id, sizeof(pci_bus_id));

#include <uct/base/uct_md.h>
#include "scal.h"

int uct_gaudi_base_get_fd(int device_id, bool *fd_created);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add space lines between functions

rc = scal_init(fd, "", &scal_handle, NULL);
}
if (rc != SCAL_SUCCESS) {
ucs_error("Failed to get scal handle");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use small case for debug/error prints, and add information if possible like device name, error code meaning/string, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants