Skip to content

Conversation

@mawad-amd
Copy link
Collaborator

@mawad-amd mawad-amd commented Jan 19, 2026

Motivation

This PR adds a virtual memory-based allocator to replace the simple torch.empty-based heap allocation in Iris. The implementation uses HIP's vmem APIs to provide better control over GPU memory management and enables DMA-BUF IPC for multi-process memory sharing.

Technical Details

Test Plan

Test Result

Submission Checklist

@mawad-amd mawad-amd requested a review from neoblizz as a code owner January 19, 2026 03:53
Copilot AI review requested due to automatic review settings January 19, 2026 03:53
@mawad-amd mawad-amd requested a review from BKP as a code owner January 19, 2026 03:53
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Jan 19, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a virtual memory-based allocator to replace the simple torch.empty-based heap allocation in Iris. The implementation uses HIP's vmem APIs to provide better control over GPU memory management and enables DMA-BUF IPC for multi-process memory sharing.

Changes:

  • Introduces C++/HIP virtual memory allocator with pybind11 bindings
  • Replaces simple heap allocation with sophisticated vmem-based allocation supporting free list reuse
  • Adds DMA-BUF file descriptor passing infrastructure for multi-process GPU memory sharing

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
setup.py Adds CMake build infrastructure for C++ extension
pyproject.toml Adds CMake as build dependency
iris/iris.py Integrates vmem allocator, replaces memory_pool with vmem_allocators, adds DMA-BUF sharing
iris/hip.py Adds DMA-BUF export/import functions (currently unused)
iris/gpu_array_wrapper.py New wrapper class exposing cuda_array_interface for vmem allocations
iris/fd_passing.py New Unix domain socket utilities for file descriptor passing via SCM_RIGHTS
csrc/src/bindings.cpp pybind11 bindings for vmem allocator
csrc/include/vmem_allocator.hpp Core C++ virtual memory allocator implementation
csrc/include/gpu_array.hpp GPU array interface utilities (appears unused)
cmake/CPM.cmake CMake package manager for dependencies
.gitignore Adds build artifacts and socket files to ignore list
.devcontainer/devcontainer.json Changes remote user to root, removes features section

return {
'version': 3,
'shape': self._shape,
'typestr': self._dtype_map[self._dtype_str],
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dtype_str is not in the _dtype_map, accessing cuda_array_interface will raise a KeyError with no helpful context. Add validation in init or provide a more informative error message when an unsupported dtype is encountered.

Copilot uses AI. Check for mistakes.
iris/iris.py Outdated
Comment on lines 354 to 364
if peer_offset != offset or peer_size != int(size):
self.warning(
f"Offset/size mismatch for peer {peer}: "
f"local(offset={offset}, size={int(size)}) vs "
f"peer(offset={peer_offset}, size={peer_size})"
)

if self.heap_offset + aligned_size > self.heap_size:
raise MemoryError("Heap out of memory")
# Import peer allocation into our VA at the *peer's* base+offset (remote address model).
peer_base = int(self.heap_bases[peer].item())
target_va = peer_base + peer_offset
self.vmem_allocators[peer].import_dmabuf_at(int(target_va), int(peer_fd), int(peer_size))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset/size mismatch check logs a warning but continues execution. This could lead to silent corruption in multi-rank scenarios where allocators diverge. Consider making this a critical error that terminates execution, or add a mechanism to synchronize and recover from allocation order mismatches.

Copilot uses AI. Check for mistakes.
Comment on lines 405 to 418
// FD is no longer needed after import (avoid leaks).
(void)::close(dmabuf_fd);

hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0));
log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size);

hipMemAccessDesc access_desc;
access_desc.location = alloc_prop_.location;
access_desc.flags = hipMemAccessFlagsProtReadWrite;
hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1));

track_allocation(target_va, aligned_size, handle, true);
log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size);
return target_va;
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file descriptor is closed after a successful import. However, if any of the subsequent HIP calls (hipMemMap or hipMemSetAccess) fail, the file descriptor will already be closed but the function will throw an exception, potentially leaking the imported handle. Consider using RAII or deferring the close until all operations succeed.

Suggested change
// FD is no longer needed after import (avoid leaks).
(void)::close(dmabuf_fd);
hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0));
log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size);
hipMemAccessDesc access_desc;
access_desc.location = alloc_prop_.location;
access_desc.flags = hipMemAccessFlagsProtReadWrite;
hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1));
track_allocation(target_va, aligned_size, handle, true);
log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size);
return target_va;
// Ensure handle and FD are properly cleaned up on failure.
try {
hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0));
log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size);
hipMemAccessDesc access_desc;
access_desc.location = alloc_prop_.location;
access_desc.flags = hipMemAccessFlagsProtReadWrite;
hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1));
// All HIP operations succeeded; FD is no longer needed after import.
(void)::close(dmabuf_fd);
track_allocation(target_va, aligned_size, handle, true);
log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size);
return target_va;
} catch (...) {
// Best-effort cleanup of imported handle on error.
hipError_t release_err = hipMemRelease(handle);
if (release_err != hipSuccess) {
log_error("Failed to release imported handle {} after error: {}", (void*)handle,
static_cast<int>(release_err));
}
// Ensure the FD is closed even if subsequent HIP calls failed.
(void)::close(dmabuf_fd);
throw;
}

Copilot uses AI. Check for mistakes.
Comment on lines 44 to 57

@property
def __cuda_array_interface__(self):
"""CUDA Array Interface v3"""
return {
'version': 3,
'shape': self._shape,
'typestr': self._dtype_map[self._dtype_str],
'data': (self._data_ptr, False), # (ptr, readonly)
'strides': None, # C-contiguous
'descr': [('', self._dtype_map[self._dtype_str])],
}

@property
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cuda_array_interface property performs dictionary lookups on every access. Since this property may be accessed frequently by PyTorch/NumPy interop code, consider caching the interface dictionary in init to avoid repeated dictionary construction and lookups.

Suggested change
@property
def __cuda_array_interface__(self):
"""CUDA Array Interface v3"""
return {
'version': 3,
'shape': self._shape,
'typestr': self._dtype_map[self._dtype_str],
'data': (self._data_ptr, False), # (ptr, readonly)
'strides': None, # C-contiguous
'descr': [('', self._dtype_map[self._dtype_str])],
}
@property
# Cache the typestr and full CUDA Array Interface dictionary to avoid
# repeated dictionary construction and lookups on every property access.
self._typestr = self._dtype_map[self._dtype_str]
self._cuda_array_interface = {
'version': 3,
'shape': self._shape,
'typestr': self._typestr,
'data': (self._data_ptr, False), # (ptr, readonly)
'strides': None, # C-contiguous
'descr': [('', self._typestr)],
}
@property
def __cuda_array_interface__(self):
"""CUDA Array Interface v3"""
return self._cuda_array_interface
@property

Copilot uses AI. Check for mistakes.
if hasattr(torch, "bfloat16"):
dtype_map[torch.bfloat16] = ("bfloat16", torch.bfloat16)

dtype_str, post_view = dtype_map.get(dtype, ("float32", None))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an unsupported dtype is passed, the code silently defaults to "float32" instead of raising an error. This could lead to silent data corruption or unexpected behavior. Consider raising a ValueError for unsupported dtypes instead of silently falling back to a default.

Suggested change
dtype_str, post_view = dtype_map.get(dtype, ("float32", None))
if dtype not in dtype_map:
raise ValueError(f"Unsupported dtype {dtype!r} for Iris allocation")
dtype_str, post_view = dtype_map[dtype]

Copilot uses AI. Check for mistakes.
iris/iris.py Outdated
heap_base = int(self.vmem_allocator.base())
self.info(f"Initialized vmem allocator with VA base: {hex(heap_base)}")
except ImportError as e:
raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.")
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message suggests building the C++ extension, but the build will fail because CMakeLists.txt is missing from the PR. This creates a confusing user experience where following the error message's instruction won't help. The error message should be updated once CMakeLists.txt is added, or provide more specific guidance about the build requirements.

Suggested change
raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.")
raise RuntimeError(
"Failed to import the Iris vmem allocator C++ extension 'iris._iris_vmem': "
f"{e}. This usually means the native extension has not been built or is not "
"installed in your current environment.\n\n"
"If you are developing Iris from source, ensure you have built and installed "
'the project with the C++ extensions enabled (for example, from the project '
'root run: `pip install -e ".[dev]"`). This requires a valid ROCm/HIP toolchain '
"and the native build configuration files (such as CMake-based build scripts) "
"to be present. For detailed and up-to-date build instructions, please refer "
"to the Iris documentation."
)

Copilot uses AI. Check for mistakes.
# Accept connections from higher ranks
for _ in range(rank + 1, world_size):
client, _ = listener.accept()
peer_rank = int.from_bytes(client.recv(4), "little", signed=False)
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recv call with a fixed 4-byte buffer could block indefinitely or fail if the peer doesn't send exactly 4 bytes. Consider adding a timeout to prevent hanging, or validate the received data length before attempting to parse it.

Copilot uses AI. Check for mistakes.
void add_to_free_list(void* va, std::size_t size) {
log_debug("Adding {} bytes at {} to free list", size, va);
free_list_.push_back({va, size});
// TODO: Could merge adjacent free blocks here for better space utilization
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment indicates that adjacent free blocks are not being merged, which can lead to memory fragmentation. This could prevent successful allocations even when sufficient total free space exists. Implement coalescing of adjacent free blocks to improve memory utilization.

Copilot uses AI. Check for mistakes.
Comment on lines 259 to 289
def export_dmabuf_handle(ptr, size):
"""
Export a DMA-BUF file descriptor for a memory range (vmem allocations).
Args:
ptr: Integer or ctypes pointer to GPU memory
size: Size of the memory range in bytes
Returns:
File descriptor (integer) for the DMA-BUF handle
Raises:
RuntimeError: If export fails or backend doesn't support it
"""
if not _is_amd_backend:
raise RuntimeError("DMA-BUF export only supported on AMD/HIP backend")

fd = ctypes.c_int(-1)
ptr_arg = ctypes.c_void_p(ptr) if isinstance(ptr, int) else ptr

# hipMemRangeHandleTypeDmaBufFd = 1
err = gpu_runtime.hipMemGetHandleForAddressRange(
ctypes.byref(fd), ptr_arg, size, 1, 0
)

if err != 0:
gpu_try(err) # Will raise with error message

return fd.value


Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The export_dmabuf_handle function is defined in hip.py but is not used anywhere in the codebase. The actual DMA-BUF export is done through the vmem_allocator's export_dmabuf method. This creates confusion about which API should be used and adds dead code. Consider removing this function or documenting why it exists separately from the vmem_allocator.

Suggested change
def export_dmabuf_handle(ptr, size):
"""
Export a DMA-BUF file descriptor for a memory range (vmem allocations).
Args:
ptr: Integer or ctypes pointer to GPU memory
size: Size of the memory range in bytes
Returns:
File descriptor (integer) for the DMA-BUF handle
Raises:
RuntimeError: If export fails or backend doesn't support it
"""
if not _is_amd_backend:
raise RuntimeError("DMA-BUF export only supported on AMD/HIP backend")
fd = ctypes.c_int(-1)
ptr_arg = ctypes.c_void_p(ptr) if isinstance(ptr, int) else ptr
# hipMemRangeHandleTypeDmaBufFd = 1
err = gpu_runtime.hipMemGetHandleForAddressRange(
ctypes.byref(fd), ptr_arg, size, 1, 0
)
if err != 0:
gpu_try(err) # Will raise with error message
return fd.value

Copilot uses AI. Check for mistakes.
path = all_paths[rank]
try:
os.unlink(path)
except FileNotFoundError:
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except FileNotFoundError:
except FileNotFoundError:
# The socket path may not exist on first use or after successful cleanup; ignore.

Copilot uses AI. Check for mistakes.
@mawad-amd
Copy link
Collaborator Author

Needs ROCm/rocm-systems#2667 resolved first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants