-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Reduce memory arena contention #8714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I wonder what happens if you plot the result of your benchmark as a function of thread count. See e.g. this NumPy issue which reported a similar scaling issue and running a benchmark as a function of thread count was a very useful way to identify the scaling issue and that it was fixed by using locking that scales better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! The approach looks good to me, though I've left some comments on some specifics.
cb4b753
to
b2a97bb
Compare
@lysnikolaou — Thanks for the review! I've applied your suggestions. Please let me know if you see anything else. |
Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it. This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread. When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena. Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us. As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).
for more information, see https://pre-commit.ci
6479c3b
to
e2a96a5
Compare
@kddnewton Could you please not rebase/force-push after the first round of reviews? It makes follow-up reviews a bit harder. |
@lysnikolaou ahh sorry! I saw it was out of date so just wanted to keep it synced. I'll avoid that going forward. |
@ngoldbaum here's the chart. Sorry I didn't end up getting it working with mathplotlib, so it's just manually putting numbers into google sheets. But the numbers come straight from the benchmarking script linked in the description. ![]() You can see from the graph it gets really bad when you start to have a lot of threads on |
There's a segfault on Ubuntu that might need some attention here. |
Yes and it appears that the other one is hanging, I will take a quick look. It was passing before, so I'm imagining this is related to my changes around looping through the arenas. |
@kddnewton Are you going to have a look at the failures here? If not, I can also spend some time on it. |
I'm actually curious if we're really getting any benefit from the Arena memory allocator -- since at least in the default case, we're not actually retaining any of the memory for reallocation. We might just be better off using the simpler block allocator. On the other hand, it may work better if we actually retain some of the freed blocks. There's a patch (2401757) in my arrow branch that enables the block allocator for all operations (instead of just being used for ImageTk as it is on main). |
Ahh apologies @lysnikolaou — I went on paternity leave a little earlier than expected, so I am out at the moment. I believe @SonicField may be able to look at this soon, but if you have time that would be great. I think the issue is how we're looping through the arenas at the moment since it changed from the initial version of the PR (I see what you mean about not force-pushing!). Worst-case I'll pick this up when I return at the end of March. |
Oh no worries at all. I'll have a look later today. Enjoy your paternity leave! |
Oh, unfortunately, it seems like I don't have the necessary permissions to push to this branch. |
@lysnikolaou just sent you an invite |
from typing import TYPE_CHECKING, Any | ||
|
||
from setuptools import Extension, setup | ||
from setuptools.command.build_ext import build_ext | ||
from setuptools.errors import CompileError | ||
|
||
if TYPE_CHECKING: | ||
import distutils.ccompiler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid some type-checking only imports:
from typing import TYPE_CHECKING, Any | |
from setuptools import Extension, setup | |
from setuptools.command.build_ext import build_ext | |
from setuptools.errors import CompileError | |
if TYPE_CHECKING: | |
import distutils.ccompiler | |
from setuptools import Extension, setup | |
from setuptools.command.build_ext import build_ext | |
from setuptools.errors import CompileError | |
TYPE_CHECKING = False | |
if TYPE_CHECKING: | |
import distutils.ccompiler | |
from collections.abc import Iterator | |
from typing import Any |
import sys | ||
import tempfile | ||
import warnings | ||
from collections.abc import Iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from collections.abc import Iterator |
Do we want this for tomorrow's release, assuming the the Arrow PR (#8330) is merged? |
I think this should be compatible with the arrow pr. I'm still curious if the simple block allocator (vs. the arena) is a simpler way to reduce the contention on the memory arena, because if it is, it's possible that the memory arena allocator isn't really an advantage anymore. |
Hi, sorry that I have let this split. The original research behind this was mine and Kevin did an amazing job of working on these PRs. It has been wonderful seeing you folk take it forward. Work on high core count machine indicated the following key concepts:
3 does not mean that memory should not ever migrate - for example as an image is processed between multiple threads. However, forcing migration was a very large overhead. This is why I came up with the thread local memory approach. How does this related to simple block allocators? My guess is that a block allocator will be as good as an arena in most cases if and only if it maintanes thread locality. If we do not maintane thread locality then we will see multi-process approachs to image processing always dominating over mult-threaded. I have my concerns that the approach as outlined here is good but might have issues for thread locality. I guess if the number of arena is high enough it resolves to the same approach as using thread local memory. One question which was raised was using jemalloc as that has thread locality built in. I have not gotten around to trying that - my bad. |
If there is no risk for confusion or breaking things, then it could go in along with the arrow PR, else wait until more questions are answered, or answered close enough for the next step in the process to be to include it in the release. Also if there is high demand for this to be included with the arrow PR, it could go in, but then we'd expect those folks to report on usage and hopefully not encounter any surprises. At a glance, it looks like it could go either way. I feel good about the arrow PR, but not as good about this one yet, but that's just me. If you want to err on the side of caution, wait for 11.3 (or 12?). If you want to get this out in the wild and start testing (again without causing confusion or disruption), 11.2 |
I don't think there's anything particularly strongly tying this to the Arrow PR, other than the Arrow PR has a way to prevent using memory arenas at all (bypassing this), and requires that for larger images to be exported via arrow. |
I've taken a bit of a look at this today, and I'm not seeing the same benchmark results. This with merging current main in, and tweaking the arrow patch so that the use block allocator selector works again. (https://github.com/wiredfool/Pillow/tree/memory-arena) For 16, 32, and 64 threads, I'm seeing very similar max and mean values between this branch and main, the only real difference is the min value. (Note, this is a 8 core Intel with 64G memory) Running the same tests against the block allocator are showing that the block allocator is showing better mean performance on my system and comparable min. On this branch, 64 threads:
On Main:
|
This is a follow-up to #8692, based on @wiredfool's feedback.
Previously there was one memory arena for all threads, making it the bottleneck for multi-threaded performance. As the number of threads increased, the contention for the lock on the arena would grow, causing other threads to wait to acquire it.
This commit makes it use 8 memory arenas, and round-robbins how they are assigned to threads. Threads keep track of the index that they should use into the arena array, assigned the first time the arena is accessed on a given thread.
When an image is first created, it is allocated from an arena. When the logic to have multiple arenas is enabled, it then keeps track of the index on the image, so that when deleted it can be returned to the correct arena.
Effectively this means that in single-threaded programs, this should not really have an effect. We also do not do this logic if the GIL is enabled, as it effectively acts as the lock on the default arena for us.
As expected, this approach has no real noticable effect on regular CPython. On free-threaded CPython, however, there is a massive difference (measuring up to about 70%).
Here is the benchmarking script that I used:
test.py
Results
3.13.0 on main
3.13.0 on branch
3.13.0t on main
3.13.0t on branch