Support for mmap across machines? #16

complyue · 2020-08-12T07:44:30Z

I have a motivating use case, favoring direct mmap over serialization/deserialization, where a Compact Region tends to be reused by many machines, from multiple parallel processes, and even multiple lifecycles of processes over time.

That's some single heap of cyclic data structures (can be immutable at light ergonomics cost), which I propose to be stored as a Compact Region, are actively scanned at heavy parallelism. While a consuming node machine will maintain some fixed number of concurrent processes performing arbitrary jobs, a process will exit after done some jobs, followed by another process created to carry on more jobs. Some jobs may share a same heap of data, so it's much desirable that such a heap be cached by os kernel pages automatically.

It's pretty straight forward by using a virtual file (e.g. driven by a FUSE filesystem) that mmap'ed with its content fetched on demand, or a physical file on a shared storage (e.g. mounted via NFS) mmap'ed will do similarly, which is easier to implement but less flexible.

I suppose a Compact Region can be read right away if I manage to have it mmap'ed to the same address in space, from another machine, but it's way over restrictive for flexibility, I wonder if pointers within a Compact Region have already be aware of relocation and would work as expected already, or how much work needed to achieve that?

And if code change needed, can it be done with a library separate from stock GHC?

complyue · 2020-08-12T08:16:17Z

On second thought, I realize there also needs a Compact Region building api, that takes a designated mmap'ed region as target storage space, instead of malloc-on-demand or sth similar. Is this feasible as well?

ezyang · 2020-08-12T16:40:49Z

The pointers will never automatically relocate, that would require GHC to generate different code to process compact region pointers, and the point is that you don't have to recompile anything. You'll have to map the memory region into exactly the same address space everywhere.

complyue · 2020-08-13T06:43:58Z

I get it, thanks. Then I'm not aware which api I can use to build a Compact Region at specified address (within a mmap'ed region), does such an api already exist?

ezyang · 2020-08-14T03:02:26Z

Oh, I forgot about some internal details of our implementation. Since compact regions have to live in honest to goodness GHC blocks in the memory manager, what you want may be somewhat difficult to actually do; at least, it's not supported out of the box here.

complyue · 2020-08-14T08:01:22Z

I'm not very familiar with internals of GHC, does the memory manager have some sorta extension mechanism, viable for me to mmap a region and persuade the allocator to use it?

I'd think with heavier use of this, a parallel Haskell implementation may perform much better on workloads with large immutable datasets as shared input.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for mmap across machines? #16

Support for mmap across machines? #16

complyue commented Aug 12, 2020

complyue commented Aug 12, 2020

ezyang commented Aug 12, 2020

complyue commented Aug 13, 2020

ezyang commented Aug 14, 2020

complyue commented Aug 14, 2020

Support for mmap across machines? #16

Support for mmap across machines? #16

Comments

complyue commented Aug 12, 2020

complyue commented Aug 12, 2020

ezyang commented Aug 12, 2020

complyue commented Aug 13, 2020

ezyang commented Aug 14, 2020

complyue commented Aug 14, 2020