-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for mmap across machines? #16
Comments
On second thought, I realize there also needs a Compact Region building api, that takes a designated mmap'ed region as target storage space, instead of malloc-on-demand or sth similar. Is this feasible as well? |
The pointers will never automatically relocate, that would require GHC to generate different code to process compact region pointers, and the point is that you don't have to recompile anything. You'll have to map the memory region into exactly the same address space everywhere. |
I get it, thanks. Then I'm not aware which api I can use to build a Compact Region at specified address (within a mmap'ed region), does such an api already exist? |
Oh, I forgot about some internal details of our implementation. Since compact regions have to live in honest to goodness GHC blocks in the memory manager, what you want may be somewhat difficult to actually do; at least, it's not supported out of the box here. |
I'm not very familiar with internals of GHC, does the memory manager have some sorta extension mechanism, viable for me to mmap a region and persuade the allocator to use it? I'd think with heavier use of this, a parallel Haskell implementation may perform much better on workloads with large immutable datasets as shared input. |
I have a motivating use case, favoring direct mmap over serialization/deserialization, where a Compact Region tends to be reused by many machines, from multiple parallel processes, and even multiple lifecycles of processes over time.
That's some single heap of cyclic data structures (can be immutable at light ergonomics cost), which I propose to be stored as a Compact Region, are actively scanned at heavy parallelism. While a consuming node machine will maintain some fixed number of concurrent processes performing arbitrary jobs, a process will exit after done some jobs, followed by another process created to carry on more jobs. Some jobs may share a same heap of data, so it's much desirable that such a heap be cached by os kernel pages automatically.
It's pretty straight forward by using a virtual file (e.g. driven by a FUSE filesystem) that mmap'ed with its content fetched on demand, or a physical file on a shared storage (e.g. mounted via NFS) mmap'ed will do similarly, which is easier to implement but less flexible.
I suppose a Compact Region can be read right away if I manage to have it mmap'ed to the same address in space, from another machine, but it's way over restrictive for flexibility, I wonder if pointers within a Compact Region have already be aware of relocation and would work as expected already, or how much work needed to achieve that?
And if code change needed, can it be done with a library separate from stock GHC?
The text was updated successfully, but these errors were encountered: