Skip to content

memmgr performance bug caused by high number of free vmas #2033

@adrianlut

Description

@adrianlut

Description of the problem

While benchmarking the Hyrise DBMS inside Gramine, a student and I found the following performance issue: Hyrise regularly allocates a lot of memory with small mmap calls and then calls munmap on them. This creates a lot of free vmas in the memmgr.h free list (possibly millions).

The problem occurs due to the CHECK_LIST_HEAD macro in list.h, which is called in both free_mem_obj_to_mgr() and get_mem_obj_from_mgr_enlarge() (i.e. on most interactions with the memmgr). It traverses the whole free list to check its correctness while the memmgr is locked, blocking the whole memory system for other threads. If the free list contains millions of entries and calls to mmap are common, this prevents any useful work from happening.

From looking at the CHECK_LIST_HEAD macro, I think this is not intended to happen in release mode since the asserts in the loop are then replaced with (void)0. However, our performance tests show that the issue occurs both in debug and release mode. I guess that although the loop does not contain useful work, it is not removed by the compiler.

Steps to reproduce

(I will follow up with code to reproduce the issue in the coming days if necessary)

  1. Compile and use Gramine from master
  2. Compile and use Hyrise TPC-H benchmark from master
  3. Run benchmark with scale factor 5, scheduler activated, 8 threads, and 8 simulated clients without Gramine
  4. Run benchmark with the same settings in Gramine

Alternative: Write a micro-benchmark

  1. Call mmap and munmap and measure the required time as baseline
  2. Fill the memmgr free list with $10^6$ items (call mmap $10^6$ times and then unmap all mapped memory)
  3. Call mmap and munmap and measure their time with a long free list.

I guess the thread-local vma cache will probably interfere with such a simple benchmark design, but the goal should be clear.

Expected results

Throughput of Hyrise benchmark when running with Gramine > 50% of throughput running without Gramine

Alternative micro-benchmark result: latency of mmap and munmap is independent of free list length.

Actual results

Throughput of Hyrise when running with Gramine is approximately 3% of throughput running without Gramine. According to the included perf functionality of Gramine debug builds, 80 % of runtime is spent in memory management/bookkeeping functions.

Alternative micro-benchmark result: latency of mmap and munmap depends on free list length.

Gramine commit hash

91c90b4

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions