-
Notifications
You must be signed in to change notification settings - Fork 31
Add virtual-memory based allocator #319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a virtual memory-based allocator to replace the simple torch.empty-based heap allocation in Iris. The implementation uses HIP's vmem APIs to provide better control over GPU memory management and enables DMA-BUF IPC for multi-process memory sharing.
Changes:
- Introduces C++/HIP virtual memory allocator with pybind11 bindings
- Replaces simple heap allocation with sophisticated vmem-based allocation supporting free list reuse
- Adds DMA-BUF file descriptor passing infrastructure for multi-process GPU memory sharing
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| setup.py | Adds CMake build infrastructure for C++ extension |
| pyproject.toml | Adds CMake as build dependency |
| iris/iris.py | Integrates vmem allocator, replaces memory_pool with vmem_allocators, adds DMA-BUF sharing |
| iris/hip.py | Adds DMA-BUF export/import functions (currently unused) |
| iris/gpu_array_wrapper.py | New wrapper class exposing cuda_array_interface for vmem allocations |
| iris/fd_passing.py | New Unix domain socket utilities for file descriptor passing via SCM_RIGHTS |
| csrc/src/bindings.cpp | pybind11 bindings for vmem allocator |
| csrc/include/vmem_allocator.hpp | Core C++ virtual memory allocator implementation |
| csrc/include/gpu_array.hpp | GPU array interface utilities (appears unused) |
| cmake/CPM.cmake | CMake package manager for dependencies |
| .gitignore | Adds build artifacts and socket files to ignore list |
| .devcontainer/devcontainer.json | Changes remote user to root, removes features section |
iris/gpu_array_wrapper.py
Outdated
| return { | ||
| 'version': 3, | ||
| 'shape': self._shape, | ||
| 'typestr': self._dtype_map[self._dtype_str], |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If dtype_str is not in the _dtype_map, accessing cuda_array_interface will raise a KeyError with no helpful context. Add validation in init or provide a more informative error message when an unsupported dtype is encountered.
iris/iris.py
Outdated
| if peer_offset != offset or peer_size != int(size): | ||
| self.warning( | ||
| f"Offset/size mismatch for peer {peer}: " | ||
| f"local(offset={offset}, size={int(size)}) vs " | ||
| f"peer(offset={peer_offset}, size={peer_size})" | ||
| ) | ||
|
|
||
| if self.heap_offset + aligned_size > self.heap_size: | ||
| raise MemoryError("Heap out of memory") | ||
| # Import peer allocation into our VA at the *peer's* base+offset (remote address model). | ||
| peer_base = int(self.heap_bases[peer].item()) | ||
| target_va = peer_base + peer_offset | ||
| self.vmem_allocators[peer].import_dmabuf_at(int(target_va), int(peer_fd), int(peer_size)) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offset/size mismatch check logs a warning but continues execution. This could lead to silent corruption in multi-rank scenarios where allocators diverge. Consider making this a critical error that terminates execution, or add a mechanism to synchronize and recover from allocation order mismatches.
| // FD is no longer needed after import (avoid leaks). | ||
| (void)::close(dmabuf_fd); | ||
|
|
||
| hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0)); | ||
| log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size); | ||
|
|
||
| hipMemAccessDesc access_desc; | ||
| access_desc.location = alloc_prop_.location; | ||
| access_desc.flags = hipMemAccessFlagsProtReadWrite; | ||
| hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1)); | ||
|
|
||
| track_allocation(target_va, aligned_size, handle, true); | ||
| log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size); | ||
| return target_va; |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file descriptor is closed after a successful import. However, if any of the subsequent HIP calls (hipMemMap or hipMemSetAccess) fail, the file descriptor will already be closed but the function will throw an exception, potentially leaking the imported handle. Consider using RAII or deferring the close until all operations succeed.
| // FD is no longer needed after import (avoid leaks). | |
| (void)::close(dmabuf_fd); | |
| hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0)); | |
| log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size); | |
| hipMemAccessDesc access_desc; | |
| access_desc.location = alloc_prop_.location; | |
| access_desc.flags = hipMemAccessFlagsProtReadWrite; | |
| hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1)); | |
| track_allocation(target_va, aligned_size, handle, true); | |
| log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size); | |
| return target_va; | |
| // Ensure handle and FD are properly cleaned up on failure. | |
| try { | |
| hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0)); | |
| log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size); | |
| hipMemAccessDesc access_desc; | |
| access_desc.location = alloc_prop_.location; | |
| access_desc.flags = hipMemAccessFlagsProtReadWrite; | |
| hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1)); | |
| // All HIP operations succeeded; FD is no longer needed after import. | |
| (void)::close(dmabuf_fd); | |
| track_allocation(target_va, aligned_size, handle, true); | |
| log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size); | |
| return target_va; | |
| } catch (...) { | |
| // Best-effort cleanup of imported handle on error. | |
| hipError_t release_err = hipMemRelease(handle); | |
| if (release_err != hipSuccess) { | |
| log_error("Failed to release imported handle {} after error: {}", (void*)handle, | |
| static_cast<int>(release_err)); | |
| } | |
| // Ensure the FD is closed even if subsequent HIP calls failed. | |
| (void)::close(dmabuf_fd); | |
| throw; | |
| } |
iris/gpu_array_wrapper.py
Outdated
|
|
||
| @property | ||
| def __cuda_array_interface__(self): | ||
| """CUDA Array Interface v3""" | ||
| return { | ||
| 'version': 3, | ||
| 'shape': self._shape, | ||
| 'typestr': self._dtype_map[self._dtype_str], | ||
| 'data': (self._data_ptr, False), # (ptr, readonly) | ||
| 'strides': None, # C-contiguous | ||
| 'descr': [('', self._dtype_map[self._dtype_str])], | ||
| } | ||
|
|
||
| @property |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuda_array_interface property performs dictionary lookups on every access. Since this property may be accessed frequently by PyTorch/NumPy interop code, consider caching the interface dictionary in init to avoid repeated dictionary construction and lookups.
| @property | |
| def __cuda_array_interface__(self): | |
| """CUDA Array Interface v3""" | |
| return { | |
| 'version': 3, | |
| 'shape': self._shape, | |
| 'typestr': self._dtype_map[self._dtype_str], | |
| 'data': (self._data_ptr, False), # (ptr, readonly) | |
| 'strides': None, # C-contiguous | |
| 'descr': [('', self._dtype_map[self._dtype_str])], | |
| } | |
| @property | |
| # Cache the typestr and full CUDA Array Interface dictionary to avoid | |
| # repeated dictionary construction and lookups on every property access. | |
| self._typestr = self._dtype_map[self._dtype_str] | |
| self._cuda_array_interface = { | |
| 'version': 3, | |
| 'shape': self._shape, | |
| 'typestr': self._typestr, | |
| 'data': (self._data_ptr, False), # (ptr, readonly) | |
| 'strides': None, # C-contiguous | |
| 'descr': [('', self._typestr)], | |
| } | |
| @property | |
| def __cuda_array_interface__(self): | |
| """CUDA Array Interface v3""" | |
| return self._cuda_array_interface | |
| @property |
| if hasattr(torch, "bfloat16"): | ||
| dtype_map[torch.bfloat16] = ("bfloat16", torch.bfloat16) | ||
|
|
||
| dtype_str, post_view = dtype_map.get(dtype, ("float32", None)) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an unsupported dtype is passed, the code silently defaults to "float32" instead of raising an error. This could lead to silent data corruption or unexpected behavior. Consider raising a ValueError for unsupported dtypes instead of silently falling back to a default.
| dtype_str, post_view = dtype_map.get(dtype, ("float32", None)) | |
| if dtype not in dtype_map: | |
| raise ValueError(f"Unsupported dtype {dtype!r} for Iris allocation") | |
| dtype_str, post_view = dtype_map[dtype] |
iris/iris.py
Outdated
| heap_base = int(self.vmem_allocator.base()) | ||
| self.info(f"Initialized vmem allocator with VA base: {hex(heap_base)}") | ||
| except ImportError as e: | ||
| raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.") |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message suggests building the C++ extension, but the build will fail because CMakeLists.txt is missing from the PR. This creates a confusing user experience where following the error message's instruction won't help. The error message should be updated once CMakeLists.txt is added, or provide more specific guidance about the build requirements.
| raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.") | |
| raise RuntimeError( | |
| "Failed to import the Iris vmem allocator C++ extension 'iris._iris_vmem': " | |
| f"{e}. This usually means the native extension has not been built or is not " | |
| "installed in your current environment.\n\n" | |
| "If you are developing Iris from source, ensure you have built and installed " | |
| 'the project with the C++ extensions enabled (for example, from the project ' | |
| 'root run: `pip install -e ".[dev]"`). This requires a valid ROCm/HIP toolchain ' | |
| "and the native build configuration files (such as CMake-based build scripts) " | |
| "to be present. For detailed and up-to-date build instructions, please refer " | |
| "to the Iris documentation." | |
| ) |
| # Accept connections from higher ranks | ||
| for _ in range(rank + 1, world_size): | ||
| client, _ = listener.accept() | ||
| peer_rank = int.from_bytes(client.recv(4), "little", signed=False) |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The recv call with a fixed 4-byte buffer could block indefinitely or fail if the peer doesn't send exactly 4 bytes. Consider adding a timeout to prevent hanging, or validate the received data length before attempting to parse it.
csrc/include/vmem_allocator.hpp
Outdated
| void add_to_free_list(void* va, std::size_t size) { | ||
| log_debug("Adding {} bytes at {} to free list", size, va); | ||
| free_list_.push_back({va, size}); | ||
| // TODO: Could merge adjacent free blocks here for better space utilization |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO comment indicates that adjacent free blocks are not being merged, which can lead to memory fragmentation. This could prevent successful allocations even when sufficient total free space exists. Implement coalescing of adjacent free blocks to improve memory utilization.
| def export_dmabuf_handle(ptr, size): | ||
| """ | ||
| Export a DMA-BUF file descriptor for a memory range (vmem allocations). | ||
| Args: | ||
| ptr: Integer or ctypes pointer to GPU memory | ||
| size: Size of the memory range in bytes | ||
| Returns: | ||
| File descriptor (integer) for the DMA-BUF handle | ||
| Raises: | ||
| RuntimeError: If export fails or backend doesn't support it | ||
| """ | ||
| if not _is_amd_backend: | ||
| raise RuntimeError("DMA-BUF export only supported on AMD/HIP backend") | ||
|
|
||
| fd = ctypes.c_int(-1) | ||
| ptr_arg = ctypes.c_void_p(ptr) if isinstance(ptr, int) else ptr | ||
|
|
||
| # hipMemRangeHandleTypeDmaBufFd = 1 | ||
| err = gpu_runtime.hipMemGetHandleForAddressRange( | ||
| ctypes.byref(fd), ptr_arg, size, 1, 0 | ||
| ) | ||
|
|
||
| if err != 0: | ||
| gpu_try(err) # Will raise with error message | ||
|
|
||
| return fd.value | ||
|
|
||
|
|
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The export_dmabuf_handle function is defined in hip.py but is not used anywhere in the codebase. The actual DMA-BUF export is done through the vmem_allocator's export_dmabuf method. This creates confusion about which API should be used and adds dead code. Consider removing this function or documenting why it exists separately from the vmem_allocator.
| def export_dmabuf_handle(ptr, size): | |
| """ | |
| Export a DMA-BUF file descriptor for a memory range (vmem allocations). | |
| Args: | |
| ptr: Integer or ctypes pointer to GPU memory | |
| size: Size of the memory range in bytes | |
| Returns: | |
| File descriptor (integer) for the DMA-BUF handle | |
| Raises: | |
| RuntimeError: If export fails or backend doesn't support it | |
| """ | |
| if not _is_amd_backend: | |
| raise RuntimeError("DMA-BUF export only supported on AMD/HIP backend") | |
| fd = ctypes.c_int(-1) | |
| ptr_arg = ctypes.c_void_p(ptr) if isinstance(ptr, int) else ptr | |
| # hipMemRangeHandleTypeDmaBufFd = 1 | |
| err = gpu_runtime.hipMemGetHandleForAddressRange( | |
| ctypes.byref(fd), ptr_arg, size, 1, 0 | |
| ) | |
| if err != 0: | |
| gpu_try(err) # Will raise with error message | |
| return fd.value |
| path = all_paths[rank] | ||
| try: | ||
| os.unlink(path) | ||
| except FileNotFoundError: |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except FileNotFoundError: | |
| except FileNotFoundError: | |
| # The socket path may not exist on first use or after successful cleanup; ignore. |
|
Needs ROCm/rocm-systems#2667 resolved first. |
Motivation
This PR adds a virtual memory-based allocator to replace the simple torch.empty-based heap allocation in Iris. The implementation uses HIP's vmem APIs to provide better control over GPU memory management and enables DMA-BUF IPC for multi-process memory sharing.
Technical Details
Test Plan
Test Result
Submission Checklist