-
Notifications
You must be signed in to change notification settings - Fork 112
Description
I have been thinking about how to do memory allocation a bunch. Here is why I think the current memory management system is not good enough and my proposal for how things should work. Please let me know what you think.
Problem
Hermit's current way of managing memory is unsuitable for delivering high-performance allocation: The heap consists of a large &[u8] that is then split up by an allocator library like Talc or galloc. This approach is fundamentally flawed, as it is prone to memory fragmentation. Hardware gives us virtual memory to tackle fragmentation: Non-contiguous frames can be mapped to contiguous ranges of pages to satisfy an allocation request. Practically all general-purpose allocators make use of this, typically indirectly via mmap and madvise. Making good use of this hardware is, to me, one of the main reasons to use a unikernel. The free-lists currently used by Hermit do not support this well: Both physical and virtual memory free lists use a smallvec, which has two downsides:
- It does not scale well to many cores
- It causes them to call the allocator after they grow to a certain size. Currently, this works because Talc does not modify memory mappings and thus does not touch these free lists.
If we change to an allocator that manipulates virtual memory mappings, this will either be impossible or very complex to reason about.
Vision
The virtual address space is divided into four regions:
- Hardware: the identity map and DeviceAlloc
- Early: memory used for the physical memory allocator, allocated before the virtual address space allocator is in place, and never freed.
The virtual address space allocator relies on the physical memory allocator, so we cannot use it yet. - Managed: managed by the virtual address space allocator
- Unmanaged: Not managed by Hermit. Applications can use this however they see fit. This could be a configurable number of slots in the page table root that is entirely user-managed. This would allow users to have custom paging schemes.
Physical Memory Space Allocation
Physical memory is managed via a buddy allocator implemented as a set of bitmaps. This consumes around 2 bits for each frame. Assuming 4Ki frames, this is an insignificant fraction of available physical memory (6.1e-5). I have an existing implementation of such a buddy allocator that I use for an experimental allocator.
To initialize the physical allocator, the total amount of physical memory and the size of the Hardware virtual memory region are computed. Then, the Early is set up. The physical allocator is initialized before the virtual one, which is why we need the Early region. Frames are mapped in the Early region to form a sufficiently large contiguous allocation for the bitmaps. Any frames required for this process (for the bitmaps themselves and the page table) are naively taken from the start of the available physical memory, possibly spanning multiple memory regions. The remaining physical memory is marked as available in the buddy allocator.
Virtual Address Space Allocation
Virtual memory is also managed by a buddy allocator. I can think of two sensible strategies:
Virtual memory with bitmaps
We could use the same strategy as for physical memory. Assuming a 2MiB allocation granularity for virtual address space, each TiB of address space costs 128KiB of physical memory to manage it as a bitmap. This could be an acceptable tradeoff if the split between the Managed and Unmanaged regions is configurable. Different or configurable allocation granularity may be something to think about.
Virtual memory with b+-trees
To save memory, virtual address space could be managed using concurrent B+-Trees. The amount of memory used then depends on how fragmented the virtual address space becomes. If it becomes too fragmented, this may consume even more memory than the bitmap, though I hope it does not get nearly that bad in practice. The nodes of the tree are sized such that their size evenly divides one frame. Nodes are allocated from the physical allocator and accessed using the identity maps. This allows dynamic allocation without relying on the global allocator. Synchronization is done using optimistic lock coupling, aka seqlocks. This approach is very scalable and routinely used in databases, see this paper for motivation. Nodes deallocated from the tree are not returned to the physical free list.
Instead, they are kept in a free-list local to the tree. This allows us to use a simple scheme for memory reclamation based on the seqlocks, which is also common practice in databases. Memory reclamation is otherwise a huge headache for concurrent data structures.
Epoch-based reclamation could be an alternative, though that seems far more complex for relatively little gain. If I recall correctly, Crossbeam's implementation needs an allocator for messages that are used to pass around objects before they are garbage collected, so that is not directly usable. Implementing epoch-based reclamation without an allocator available may turn out to be very difficult, though I have not thought too much about this approach.
The allocator
Once we have facilities for managing physical and virtual memory in place, we can implement a proper allocator.
Something similar to snmalloc with the threads corresponding to hardware threads might be a good candidate.
The exact allocator implementation can be decided on in the future.
Until then, we can just allocate a fixed fraction of the physical memory size from both the virtual and physical allocators and map it as a heap for use with Talc or galloc, as is currently the case. Once efficient interfaces for allocating physical and virtual memory are in place, experimentation with different allocators will be much easier.
Implementation
This is a lot of work, more than is reasonable to complete in one go.
I recommend splitting it into smaller tasks as follows.
Abstract Memory Allocation
Currently, the free lists are used directly all over the kernel.
Replace these with functions defined in src/mm to allocate and deallocate physical and virtual memory.
For the physical allocation interface, I am not sure yet what the allocation request should be.
At some point, the power-of-two-sized chunks from the buddy allocator should be broken up/combined to serve the requested sizes.
Normal memory can just always be allocated in powers of two (e.g., Multiple allocations of 2MiB to populate some memory mapping).
Device memory, in contrast, may require odd sizes of contiguous physical memory.
Options for what an allocation request could be:
- a
core::mem::Layout. - a naturally aligned power of two size (matching the buddy allocator)
- chosen from the page sizes of the architecture (i.e., 4Ki,2Mi,1Gi for x86_64).
Physical allocator
Replace the current physical free list with the buddy allocator.
This should be fairly easy to do.
The hardest part of this is probably implementing the allocation of the Early region.
Bitmap Virtual allocator
Replace the virtual allocator with a bitmap-based buddy allocator.
This will be easy, as the bitmap allocator has already been implemented for physical memory.
Implement the allocator
A simple implementation of an allocator should be moderately difficult.
I have implemented memory allocators before, so I think this is doable.
A lot more work could be sunk into tuning this, of course.
Consider B-Tree
Evaluate how the bitmap allocator for virtual memory fares in practice and how bad fragmentation is.
Then, if necessary, switch to a B-Tree-based implementation.
I have implemented B-Trees with support for inlined dynamically sized keys and values before, which was significant work.
I expect that restricting this to integer keys and values will make it much easier.
Still, this will require significant effort, especially in testing to ensure correctness.