Reduce memory arena contention #8714

kddnewton · 2025-01-24T19:46:43Z

This is a follow-up to #8692, based on @wiredfool's feedback.

Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it.

This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread.

When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena.

Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us.

As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).

Here is the benchmarking script that I used:

test.py

import concurrent.futures
import os
import threading
import time

from PIL import Image

num_threads = 16
num_images = 1024


def operation():
    images = []
    for i in range(num_images):
        img = Image.new(
            "RGB", (100, 100), color=(i % 256, (i // 256) % 256, (i // 65536) % 256)
        )
        images.append(img)

    for img in images:
        img = img.convert("CMYK")

    images.clear()


def worker(barrier):
    barrier.wait()
    runtimes = []

    for _ in range(5):
        start_time = time.time()
        operation()
        end_time = time.time()
        runtimes.append(end_time - start_time)

    return runtimes


def benchmark():
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        barrier = threading.Barrier(num_threads)
        futures = [executor.submit(worker, barrier) for _ in range(num_threads)]

        run_times = []
        for future in concurrent.futures.as_completed(futures):
            try:
                run_times.extend(future.result())
            except IndexError:
                os._exit(-1)

        min_time = min(run_times)
        max_time = max(run_times)
        mean_time = sum(run_times) / len(run_times)
        print(f"Max: {max_time:.6f} Mean: {mean_time:.6f} Min: {min_time:.6f}")


benchmark()

Results

3.13.0 on main

$ python -VV                                        
Python 3.13.0 (tags/v3.13.0:60403a5409f, Jan  3 2025, 14:04:52) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.404962 Mean: 0.353120 Min: 0.303807
$ python test.py
Max: 0.369188 Mean: 0.320218 Min: 0.282613
$ python test.py
Max: 0.386692 Mean: 0.335509 Min: 0.294476
$ python test.py
Max: 0.394410 Mean: 0.350275 Min: 0.299456
$ python test.py
Max: 0.416075 Mean: 0.354347 Min: 0.309045

3.13.0 on branch

$ python -VV
Python 3.13.0 (tags/v3.13.0:60403a5409f, Jan  3 2025, 14:04:52) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.422371 Mean: 0.354453 Min: 0.292521
$ python test.py
Max: 0.423698 Mean: 0.358393 Min: 0.313581
$ python test.py
Max: 0.405487 Mean: 0.356346 Min: 0.299354
$ python test.py
Max: 0.431244 Mean: 0.369772 Min: 0.308096
$ python test.py
Max: 0.472806 Mean: 0.377575 Min: 0.313588

3.13.0t on main

$ python -VV
Python 3.13.0 experimental free-threading build (tags/v3.13.0:60403a5409f, Jan  7 2025, 13:53:44) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.161379 Mean: 0.114555 Min: 0.072739
$ python test.py
Max: 0.188203 Mean: 0.133376 Min: 0.095111
$ python test.py
Max: 0.181084 Mean: 0.128733 Min: 0.086086
$ python test.py
Max: 0.187286 Mean: 0.131114 Min: 0.094561
$ python test.py
Max: 0.191979 Mean: 0.133439 Min: 0.097527

3.13.0t on branch

$ python -VV  
Python 3.13.0 experimental free-threading build (tags/v3.13.0:60403a5409f, Jan  7 2025, 13:53:44) [Clang 16.0.0 (clang-1600.0.26.6)]
$ python test.py
Max: 0.090055 Mean: 0.044987 Min: 0.019220
$ python test.py
Max: 0.095986 Mean: 0.040712 Min: 0.019204
$ python test.py
Max: 0.094424 Mean: 0.041949 Min: 0.016574
$ python test.py
Max: 0.084610 Mean: 0.042789 Min: 0.016866
$ python test.py
Max: 0.095480 Mean: 0.044068 Min: 0.015999

ngoldbaum · 2025-01-24T20:06:09Z

I wonder what happens if you plot the result of your benchmark as a function of thread count. See e.g. this NumPy issue which reported a similar scaling issue and running a benchmark as a function of thread count was a very useful way to identify the scaling issue and that it was fixed by using locking that scales better.

lysnikolaou

Great job! The approach looks good to me, though I've left some comments on some specifics.

src/libImaging/Storage.c

src/_imaging.c

src/libImaging/Storage.c

kddnewton · 2025-01-30T13:28:54Z

@lysnikolaou — Thanks for the review! I've applied your suggestions. Please let me know if you see anything else.
@ngoldbaum — I'll work on a graph. I chose 8 arenas after some experimentation, but it would be good to back this up.

Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it. This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread. When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena. Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us. As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).

for more information, see https://pre-commit.ci

lysnikolaou · 2025-01-31T15:31:39Z

@kddnewton Could you please not rebase/force-push after the first round of reviews? It makes follow-up reviews a bit harder.

kddnewton · 2025-01-31T15:34:32Z

@lysnikolaou ahh sorry! I saw it was out of date so just wanted to keep it synced. I'll avoid that going forward.

kddnewton · 2025-01-31T17:22:15Z

@ngoldbaum here's the chart. Sorry I didn't end up getting it working with mathplotlib, so it's just manually putting numbers into google sheets. But the numbers come straight from the benchmarking script linked in the description.

You can see from the graph it gets really bad when you start to have a lot of threads on main with free threading. This branch and main are about the same when running without free threading. On this branch with free threading it is significantly better.

lysnikolaou · 2025-01-31T20:32:19Z

There's a segfault on Ubuntu that might need some attention here.

kddnewton · 2025-01-31T20:51:57Z

Yes and it appears that the other one is hanging, I will take a quick look. It was passing before, so I'm imagining this is related to my changes around looping through the arenas.

lysnikolaou · 2025-02-12T10:06:23Z

@kddnewton Are you going to have a look at the failures here? If not, I can also spend some time on it.

wiredfool · 2025-02-12T22:10:57Z

I'm actually curious if we're really getting any benefit from the Arena memory allocator -- since at least in the default case, we're not actually retaining any of the memory for reallocation. We might just be better off using the simpler block allocator. On the other hand, it may work better if we actually retain some of the freed blocks.

There's a patch (2401757) in my arrow branch that enables the block allocator for all operations (instead of just being used for ImageTk as it is on main).

kddnewton · 2025-02-13T19:51:45Z

Ahh apologies @lysnikolaou — I went on paternity leave a little earlier than expected, so I am out at the moment. I believe @SonicField may be able to look at this soon, but if you have time that would be great. I think the issue is how we're looping through the arenas at the moment since it changed from the initial version of the PR (I see what you mean about not force-pushing!).

Worst-case I'll pick this up when I return at the end of March.

lysnikolaou · 2025-02-14T10:53:25Z

Oh no worries at all. I'll have a look later today. Enjoy your paternity leave!

lysnikolaou · 2025-02-14T18:09:45Z

Oh, unfortunately, it seems like I don't have the necessary permissions to push to this branch.

kddnewton · 2025-02-14T18:18:07Z

@lysnikolaou just sent you an invite

for more information, see https://pre-commit.ci

hugovk · 2025-03-31T10:41:13Z

setup.py

+from typing import TYPE_CHECKING, Any

 from setuptools import Extension, setup
 from setuptools.command.build_ext import build_ext
+from setuptools.errors import CompileError
+
+if TYPE_CHECKING:
+    import distutils.ccompiler


Let's avoid some type-checking only imports:

Suggested change

from typing import TYPE_CHECKING, Any

from setuptools import Extension, setup

from setuptools.command.build_ext import build_ext

from setuptools.errors import CompileError

if TYPE_CHECKING:

import distutils.ccompiler

from setuptools import Extension, setup

from setuptools.command.build_ext import build_ext

from setuptools.errors import CompileError

TYPE_CHECKING = False

if TYPE_CHECKING:

import distutils.ccompiler

from collections.abc import Iterator

from typing import Any

hugovk · 2025-03-31T10:41:26Z

setup.py

 import sys
+import tempfile
 import warnings
 from collections.abc import Iterator


Suggested change

from collections.abc import Iterator

hugovk · 2025-03-31T10:43:55Z

Do we want this for tomorrow's release, assuming the the Arrow PR (#8330) is merged?

wiredfool · 2025-03-31T11:41:16Z

I think this should be compatible with the arrow pr. I'm still curious if the simple block allocator (vs. the arena) is a simpler way to reduce the contention on the memory arena, because if it is, it's possible that the memory arena allocator isn't really an advantage anymore.

SonicField · 2025-03-31T13:06:58Z

Hi, sorry that I have let this split. The original research behind this was mine and Kevin did an amazing job of working on these PRs. It has been wonderful seeing you folk take it forward.

Work on high core count machine indicated the following key concepts:

Thread contention in the areanas was important.
However, inter-core communication also was very important.
Therefore, preventing memory migration between cores was ciritcal to high performance.

3 does not mean that memory should not ever migrate - for example as an image is processed between multiple threads. However, forcing migration was a very large overhead. This is why I came up with the thread local memory approach.

How does this related to simple block allocators? My guess is that a block allocator will be as good as an arena in most cases if and only if it maintanes thread locality. If we do not maintane thread locality then we will see multi-process approachs to image processing always dominating over mult-threaded.

I have my concerns that the approach as outlined here is good but might have issues for thread locality. I guess if the number of arena is high enough it resolves to the same approach as using thread local memory.

One question which was raised was using jemalloc as that has thread locality built in. I have not gotten around to trying that - my bad.

aclark4life · 2025-03-31T13:19:30Z

If there is no risk for confusion or breaking things, then it could go in along with the arrow PR, else wait until more questions are answered, or answered close enough for the next step in the process to be to include it in the release.

Also if there is high demand for this to be included with the arrow PR, it could go in, but then we'd expect those folks to report on usage and hopefully not encounter any surprises.

At a glance, it looks like it could go either way. I feel good about the arrow PR, but not as good about this one yet, but that's just me. If you want to err on the side of caution, wait for 11.3 (or 12?). If you want to get this out in the wild and start testing (again without causing confusion or disruption), 11.2

wiredfool · 2025-03-31T13:23:54Z

I don't think there's anything particularly strongly tying this to the Arrow PR, other than the Arrow PR has a way to prevent using memory arenas at all (bypassing this), and requires that for larger images to be exported via arrow.

wiredfool · 2025-05-30T14:10:38Z

I've taken a bit of a look at this today, and I'm not seeing the same benchmark results. This with merging current main in, and tweaking the arrow patch so that the use block allocator selector works again. (https://github.com/wiredfool/Pillow/tree/memory-arena)

For 16, 32, and 64 threads, I'm seeing very similar max and mean values between this branch and main, the only real difference is the min value. (Note, this is a 8 core Intel with 64G memory)

Running the same tests against the block allocator are showing that the block allocator is showing better mean performance on my system and comparable min.

On this branch, 64 threads:

(vpy313t) erics@wf:~/test$ python ./mc.py
Max: 5.235517 Mean: 1.371056 Min: 0.153203
Block:
Max: 5.136373 Mean: 0.878356 Min: 0.100866

On Main:

(vpy313t) erics@wf:~/test$ python ./mc.py
Max: 2.283538 Mean: 1.524372 Min: 0.926908
Block:
Max: 4.913099 Mean: 0.836728 Min: 0.101310

kddnewton mentioned this pull request Jan 24, 2025

Thread-local arenas #8692

Closed

radarhere mentioned this pull request Jan 25, 2025

Only import distutils when type checking kddnewton/Pillow#3

Merged

radarhere added the Free-threading PEP 703 support label Jan 25, 2025

lysnikolaou reviewed Jan 29, 2025

View reviewed changes

src/libImaging/Storage.c Show resolved Hide resolved

src/libImaging/Storage.c Show resolved Hide resolved

src/_imaging.c Outdated Show resolved Hide resolved

src/libImaging/Storage.c Show resolved Hide resolved

kddnewton force-pushed the reduce-contention branch from cb4b753 to b2a97bb Compare January 30, 2025 13:26

kddnewton and others added 3 commits January 31, 2025 10:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

cff2141

for more information, see https://pre-commit.ci

Only import distutils when type checking

e2a96a5

kddnewton force-pushed the reduce-contention branch from 6479c3b to e2a96a5 Compare January 31, 2025 15:29

Merge branch 'main' into reduce-contention

19154a5

lysnikolaou and others added 4 commits February 14, 2025 21:09

Fix iterations through arenas

7d2ea76

[pre-commit.ci] auto fixes from pre-commit.com hooks

10104eb

for more information, see https://pre-commit.ci

Avoid race on reading alignment in for-loop

308e1fc

Merge branch 'main' into reduce-contention

f91fcf3

hugovk reviewed Mar 31, 2025

View reviewed changes

radarhere added the Needs Rebase label May 30, 2025

Uh oh!

Reduce memory arena contention #8714

Are you sure you want to change the base?

Reduce memory arena contention #8714

Uh oh!

Conversation

kddnewton commented Jan 24, 2025 • edited by hugovk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

3.13.0 on main

3.13.0 on branch

3.13.0t on main

3.13.0t on branch

Uh oh!

ngoldbaum commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lysnikolaou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kddnewton commented Jan 30, 2025

Uh oh!

lysnikolaou commented Jan 31, 2025

Uh oh!

kddnewton commented Jan 31, 2025

Uh oh!

kddnewton commented Jan 31, 2025

Uh oh!

lysnikolaou commented Jan 31, 2025

Uh oh!

kddnewton commented Jan 31, 2025

Uh oh!

lysnikolaou commented Feb 12, 2025

Uh oh!

wiredfool commented Feb 12, 2025

Uh oh!

kddnewton commented Feb 13, 2025

Uh oh!

lysnikolaou commented Feb 14, 2025

Uh oh!

lysnikolaou commented Feb 14, 2025

Uh oh!

kddnewton commented Feb 14, 2025

Uh oh!

hugovk Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

hugovk Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

hugovk commented Mar 31, 2025

Uh oh!

wiredfool commented Mar 31, 2025

Uh oh!

SonicField commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aclark4life commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wiredfool commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wiredfool commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

kddnewton commented Jan 24, 2025 •

edited by hugovk

Loading

ngoldbaum commented Jan 24, 2025 •

edited

Loading

SonicField commented Mar 31, 2025 •

edited

Loading

aclark4life commented Mar 31, 2025 •

edited

Loading

wiredfool commented Mar 31, 2025 •

edited

Loading

wiredfool commented May 30, 2025 •

edited

Loading