Add virtual-memory based allocator #319

mawad-amd · 2026-01-19T03:53:49Z

Motivation

This PR adds a virtual memory-based allocator to replace the simple torch.empty-based heap allocation in Iris. The implementation uses HIP's vmem APIs to provide better control over GPU memory management and enables DMA-BUF IPC for multi-process memory sharing.

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR adds a virtual memory-based allocator to replace the simple torch.empty-based heap allocation in Iris. The implementation uses HIP's vmem APIs to provide better control over GPU memory management and enables DMA-BUF IPC for multi-process memory sharing.

Changes:

Introduces C++/HIP virtual memory allocator with pybind11 bindings
Replaces simple heap allocation with sophisticated vmem-based allocation supporting free list reuse
Adds DMA-BUF file descriptor passing infrastructure for multi-process GPU memory sharing

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
setup.py	Adds CMake build infrastructure for C++ extension
pyproject.toml	Adds CMake as build dependency
iris/iris.py	Integrates vmem allocator, replaces memory_pool with vmem_allocators, adds DMA-BUF sharing
iris/hip.py	Adds DMA-BUF export/import functions (currently unused)
iris/gpu_array_wrapper.py	New wrapper class exposing cuda_array_interface for vmem allocations
iris/fd_passing.py	New Unix domain socket utilities for file descriptor passing via SCM_RIGHTS
csrc/src/bindings.cpp	pybind11 bindings for vmem allocator
csrc/include/vmem_allocator.hpp	Core C++ virtual memory allocator implementation
csrc/include/gpu_array.hpp	GPU array interface utilities (appears unused)
cmake/CPM.cmake	CMake package manager for dependencies
.gitignore	Adds build artifacts and socket files to ignore list
.devcontainer/devcontainer.json	Changes remote user to root, removes features section

Copilot · 2026-01-19T04:02:42Z

iris/gpu_array_wrapper.py

+        return {
+            'version': 3,
+            'shape': self._shape,
+            'typestr': self._dtype_map[self._dtype_str],


If dtype_str is not in the _dtype_map, accessing cuda_array_interface will raise a KeyError with no helpful context. Add validation in init or provide a more informative error message when an unsupported dtype is encountered.

Copilot · 2026-01-19T04:02:42Z

iris/iris.py

+                if peer_offset != offset or peer_size != int(size):
+                    self.warning(
+                        f"Offset/size mismatch for peer {peer}: "
+                        f"local(offset={offset}, size={int(size)}) vs "
+                        f"peer(offset={peer_offset}, size={peer_size})"
+                    )

-        if self.heap_offset + aligned_size > self.heap_size:
-            raise MemoryError("Heap out of memory")
+                # Import peer allocation into our VA at the *peer's* base+offset (remote address model).
+                peer_base = int(self.heap_bases[peer].item())
+                target_va = peer_base + peer_offset
+                self.vmem_allocators[peer].import_dmabuf_at(int(target_va), int(peer_fd), int(peer_size))


The offset/size mismatch check logs a warning but continues execution. This could lead to silent corruption in multi-rank scenarios where allocators diverge. Consider making this a critical error that terminates execution, or add a mechanism to synchronize and recover from allocation order mismatches.

Copilot · 2026-01-19T04:02:43Z

csrc/include/vmem_allocator.hpp

+    // FD is no longer needed after import (avoid leaks).
+    (void)::close(dmabuf_fd);
+
+    hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0));
+    log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size);
+
+    hipMemAccessDesc access_desc;
+    access_desc.location = alloc_prop_.location;
+    access_desc.flags = hipMemAccessFlagsProtReadWrite;
+    hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1));
+
+    track_allocation(target_va, aligned_size, handle, true);
+    log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size);
+    return target_va;


The file descriptor is closed after a successful import. However, if any of the subsequent HIP calls (hipMemMap or hipMemSetAccess) fail, the file descriptor will already be closed but the function will throw an exception, potentially leaking the imported handle. Consider using RAII or deferring the close until all operations succeed.

Suggested change

// FD is no longer needed after import (avoid leaks).

(void)::close(dmabuf_fd);

hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0));

log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size);

hipMemAccessDesc access_desc;

access_desc.location = alloc_prop_.location;

access_desc.flags = hipMemAccessFlagsProtReadWrite;

hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1));

track_allocation(target_va, aligned_size, handle, true);

log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size);

return target_va;

// Ensure handle and FD are properly cleaned up on failure.

try {

hip_try(hipMemMap(target_va, aligned_size, 0, handle, 0));

log_debug("Mapped imported handle to VA {}, size {}", target_va, aligned_size);

hipMemAccessDesc access_desc;

access_desc.location = alloc_prop_.location;

access_desc.flags = hipMemAccessFlagsProtReadWrite;

hip_try(hipMemSetAccess(target_va, aligned_size, &access_desc, 1));

// All HIP operations succeeded; FD is no longer needed after import.

(void)::close(dmabuf_fd);

track_allocation(target_va, aligned_size, handle, true);

log_info("Imported dmabuf fd={} to {}, size {}", dmabuf_fd, target_va, aligned_size);

return target_va;

} catch (...) {

// Best-effort cleanup of imported handle on error.

hipError_t release_err = hipMemRelease(handle);

if (release_err != hipSuccess) {

log_error("Failed to release imported handle {} after error: {}", (void*)handle,

static_cast<int>(release_err));

}

// Ensure the FD is closed even if subsequent HIP calls failed.

(void)::close(dmabuf_fd);

throw;

}

Copilot · 2026-01-19T04:02:43Z

iris/gpu_array_wrapper.py

+
+    @property
+    def __cuda_array_interface__(self):
+        """CUDA Array Interface v3"""
+        return {
+            'version': 3,
+            'shape': self._shape,
+            'typestr': self._dtype_map[self._dtype_str],
+            'data': (self._data_ptr, False),  # (ptr, readonly)
+            'strides': None,  # C-contiguous
+            'descr': [('', self._dtype_map[self._dtype_str])],
+        }
+
+    @property


The cuda_array_interface property performs dictionary lookups on every access. Since this property may be accessed frequently by PyTorch/NumPy interop code, consider caching the interface dictionary in init to avoid repeated dictionary construction and lookups.

Suggested change

@property

def __cuda_array_interface__(self):

"""CUDA Array Interface v3"""

return {

'version': 3,

'shape': self._shape,

'typestr': self._dtype_map[self._dtype_str],

'data': (self._data_ptr, False), # (ptr, readonly)

'strides': None, # C-contiguous

'descr': [('', self._dtype_map[self._dtype_str])],

}

@property

# Cache the typestr and full CUDA Array Interface dictionary to avoid

# repeated dictionary construction and lookups on every property access.

self._typestr = self._dtype_map[self._dtype_str]

self._cuda_array_interface = {

'version': 3,

'shape': self._shape,

'typestr': self._typestr,

'data': (self._data_ptr, False), # (ptr, readonly)

'strides': None, # C-contiguous

'descr': [('', self._typestr)],

}

@property

def __cuda_array_interface__(self):

"""CUDA Array Interface v3"""

return self._cuda_array_interface

@property

Copilot · 2026-01-19T04:02:43Z

iris/iris.py

+        if hasattr(torch, "bfloat16"):
+            dtype_map[torch.bfloat16] = ("bfloat16", torch.bfloat16)
+
+        dtype_str, post_view = dtype_map.get(dtype, ("float32", None))


If an unsupported dtype is passed, the code silently defaults to "float32" instead of raising an error. This could lead to silent data corruption or unexpected behavior. Consider raising a ValueError for unsupported dtypes instead of silently falling back to a default.

Suggested change

dtype_str, post_view = dtype_map.get(dtype, ("float32", None))

if dtype not in dtype_map:

raise ValueError(f"Unsupported dtype {dtype!r} for Iris allocation")

dtype_str, post_view = dtype_map[dtype]

Copilot · 2026-01-19T04:02:46Z

iris/iris.py

+            heap_base = int(self.vmem_allocator.base())
+            self.info(f"Initialized vmem allocator with VA base: {hex(heap_base)}")
+        except ImportError as e:
+            raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.")


The error message suggests building the C++ extension, but the build will fail because CMakeLists.txt is missing from the PR. This creates a confusing user experience where following the error message's instruction won't help. The error message should be updated once CMakeLists.txt is added, or provide more specific guidance about the build requirements.

Suggested change

raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.")

raise RuntimeError(

"Failed to import the Iris vmem allocator C++ extension 'iris._iris_vmem': "

f"{e}. This usually means the native extension has not been built or is not "

"installed in your current environment.\n\n"

"If you are developing Iris from source, ensure you have built and installed "

'the project with the C++ extensions enabled (for example, from the project '

'root run: `pip install -e ".[dev]"`). This requires a valid ROCm/HIP toolchain '

"and the native build configuration files (such as CMake-based build scripts) "

"to be present. For detailed and up-to-date build instructions, please refer "

"to the Iris documentation."

)

Copilot · 2026-01-19T04:02:46Z

iris/fd_passing.py

+    # Accept connections from higher ranks
+    for _ in range(rank + 1, world_size):
+        client, _ = listener.accept()
+        peer_rank = int.from_bytes(client.recv(4), "little", signed=False)


The recv call with a fixed 4-byte buffer could block indefinitely or fail if the peer doesn't send exactly 4 bytes. Consider adding a timeout to prevent hanging, or validate the received data length before attempting to parse it.

Copilot · 2026-01-19T04:02:46Z

csrc/include/vmem_allocator.hpp

+  void add_to_free_list(void* va, std::size_t size) {
+    log_debug("Adding {} bytes at {} to free list", size, va);
+    free_list_.push_back({va, size});
+    // TODO: Could merge adjacent free blocks here for better space utilization


The TODO comment indicates that adjacent free blocks are not being merged, which can lead to memory fragmentation. This could prevent successful allocations even when sufficient total free space exists. Implement coalescing of adjacent free blocks to improve memory utilization.

Copilot · 2026-01-19T04:02:47Z

iris/hip.py

+def export_dmabuf_handle(ptr, size):
+    """
+    Export a DMA-BUF file descriptor for a memory range (vmem allocations).
+    
+    Args:
+        ptr: Integer or ctypes pointer to GPU memory
+        size: Size of the memory range in bytes
+    
+    Returns:
+        File descriptor (integer) for the DMA-BUF handle
+    
+    Raises:
+        RuntimeError: If export fails or backend doesn't support it
+    """
+    if not _is_amd_backend:
+        raise RuntimeError("DMA-BUF export only supported on AMD/HIP backend")
+
+    fd = ctypes.c_int(-1)
+    ptr_arg = ctypes.c_void_p(ptr) if isinstance(ptr, int) else ptr
+
+    # hipMemRangeHandleTypeDmaBufFd = 1
+    err = gpu_runtime.hipMemGetHandleForAddressRange(
+        ctypes.byref(fd), ptr_arg, size, 1, 0
+    )
+
+    if err != 0:
+        gpu_try(err)  # Will raise with error message
+
+    return fd.value
+
+


The export_dmabuf_handle function is defined in hip.py but is not used anywhere in the codebase. The actual DMA-BUF export is done through the vmem_allocator's export_dmabuf method. This creates confusion about which API should be used and adds dead code. Consider removing this function or documenting why it exists separately from the vmem_allocator.

Suggested change

def export_dmabuf_handle(ptr, size):

"""

Export a DMA-BUF file descriptor for a memory range (vmem allocations).

Args:

ptr: Integer or ctypes pointer to GPU memory

size: Size of the memory range in bytes

Returns:

File descriptor (integer) for the DMA-BUF handle

Raises:

RuntimeError: If export fails or backend doesn't support it

"""

if not _is_amd_backend:

raise RuntimeError("DMA-BUF export only supported on AMD/HIP backend")

fd = ctypes.c_int(-1)

ptr_arg = ctypes.c_void_p(ptr) if isinstance(ptr, int) else ptr

# hipMemRangeHandleTypeDmaBufFd = 1

err = gpu_runtime.hipMemGetHandleForAddressRange(

ctypes.byref(fd), ptr_arg, size, 1, 0

)

if err != 0:

gpu_try(err) # Will raise with error message

return fd.value

Copilot · 2026-01-19T04:02:47Z

iris/fd_passing.py

+    path = all_paths[rank]
+    try:
+        os.unlink(path)
+    except FileNotFoundError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except FileNotFoundError:

except FileNotFoundError:

# The socket path may not exist on first use or after successful cleanup; ignore.

mawad-amd · 2026-01-19T05:01:12Z

Needs ROCm/rocm-systems#2667 resolved first.

mawad-amd added 9 commits January 19, 2026 03:49

Update dev container

30e6765

Add CPM.cmake

91e83fe

Add vmem allocator

0ecb567

Add bindings

5211f81

Add DMA BUF helper

6a3adf2

Add fd sender helper

060cf43

Add wrapper for C Array

b71074d

Add new allocator

73fca97

Update project setup

6b90267

mawad-amd requested a review from neoblizz as a code owner January 19, 2026 03:53

Copilot AI review requested due to automatic review settings January 19, 2026 03:53

mawad-amd requested a review from BKP as a code owner January 19, 2026 03:53

github-actions bot added in-progress We are working on it iris Iris project issue labels Jan 19, 2026

Copilot started reviewing on behalf of mawad-amd January 19, 2026 03:54 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

mawad-amd and others added 11 commits January 20, 2026 04:41

Add ROCm bug workaround

f20c4ae

Fix linter issues

3f3a12c

Apply Ruff auto-fixes

6842b2c

Fix ssh helper script

dc5584a

Add import tests

1623500

Apply Ruff auto-fixes

abd9264

Rename to as symmetric

31faf1b

Rename to as symmetric

80efc06

Apply Ruff auto-fixes

105bec1

Remove old file

360aa04

Fixing tests

790f199

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add virtual-memory based allocator #319

Add virtual-memory based allocator #319

Uh oh!

mawad-amd commented Jan 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

Copilot AI Jan 19, 2026

Uh oh!

mawad-amd commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        dtype_str, post_view = dtype_map.get(dtype, ("float32", None))
+        if dtype not in dtype_map:
+            raise ValueError(f"Unsupported dtype {dtype!r} for Iris allocation")
+        dtype_str, post_view = dtype_map[dtype]

-            raise RuntimeError(f"Failed to import vmem allocator: {e}. Please build the C++ extension.")
+            raise RuntimeError(
+                "Failed to import the Iris vmem allocator C++ extension 'iris._iris_vmem': "
+                f"{e}. This usually means the native extension has not been built or is not "
+                "installed in your current environment.\n\n"
+                "If you are developing Iris from source, ensure you have built and installed "
+                'the project with the C++ extensions enabled (for example, from the project '
+                'root run: `pip install -e ".[dev]"`). This requires a valid ROCm/HIP toolchain '
+                "and the native build configuration files (such as CMake-based build scripts) "
+                "to be present. For detailed and up-to-date build instructions, please refer "
+                "to the Iris documentation."
+            )

	except FileNotFoundError:
	except FileNotFoundError:
	# The socket path may not exist on first use or after successful cleanup; ignore.

Add virtual-memory based allocator #319

Are you sure you want to change the base?

Add virtual-memory based allocator #319

Uh oh!

Conversation

mawad-amd commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

mawad-amd commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mawad-amd commented Jan 19, 2026 •

edited

Loading