-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add per-interpreter storage for gil_safe_call_once_and_store
#5933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
58834ac to
a46973b
Compare
a46973b to
e741760
Compare
8e5bb6e to
d5b8813
Compare
include/pybind11/detail/internals.h
Outdated
| // FIXME: We cannot use `get_num_interpreters_seen() > 1` here to create a fast path for | ||
| // the multi-interpreter case. The singleton may be initialized by a subinterpreter not the | ||
| // main interpreter. | ||
| // | ||
| // For multi-interpreter support, the subinterpreters can be initialized concurrently, and | ||
| // the first time this function may not be called in the main interpreter. | ||
| // For example, a clean main interpreter that does not import any pybind11 module and then | ||
| // spawns multiple subinterpreters using `InterpreterPoolExecutor` that each imports a | ||
| // pybind11 module concurrently. | ||
| if (get_num_interpreters_seen() > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @b-pass @rwgk for thoughts about this, see the code comment referenced for more details.
I test the patch in this PR in:
Most things works fine except test_import_in_subinterpreter_before_main:
run_in_subprocess(
"""
with contextlib.closing(interpreters.create()) as subinterpreter:
subinterpreter.exec('import optree')
import optree
"""
)If I remove the get_num_interpreters_seen() > 1 condition, my import test works but the cpptest in pybind11 CI breaks because internals_singleton_pp_ is never initialized. For instance, the memory address get_internals() should be a shared constant for differenet pybind11 modules in a single program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_num_interpreters_seen() > 1 check is meant to keep the "there are no subinterpreters" case (the most common case) as fast as possible.
Can you help me understand the problem that concurrent subinterpreter imports is causing?
I'm not understanding what exactly the problem is... doing concurrent imports of pybind11 modules each in its own subinterpreter should still result in each subinterpreter getting its own internals (as it should). The one that gets to this code first would populate the internals_singleton_pp_, but this does not have to be the main interpreter, and the next time get_pp is called from that first subinterprter it will take the other (> 1) code path. Since get_num_interpreters_seen() never decreases (it is "seen" not "currently alive"), once it goes up to 2 the inner code path will be used and what is stored in the internals_singleton_pp_ doesn't matter for the rest of the program.
What's the purpose of get_pp_for_main_interpreter() and the associated once flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me elaborate more about my failed test. I have a C++ singleton instance using gil_safe_call_once_and_store to ensure exactly one instance per-interpreter. With the patch in this PR, the call-once result is stored in the internals (*get_pp()), where the internals is stored in the interpreter's state dict.
MyClass &get_singleton() {
PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<MyClass> storage;
return storage
.call_once_and_store_result([]() -> MyClass {
MyClass instance{};
{
// initialize
...
}
return instance;
})
.get_stored();
}- OK -
test_import_in_subinterpreters_concurrently: import the module in multiple subinterpreters concurrently, without any import for a pybind11 module in the main interpreter. - OK -
test_import_in_subinterpreter_after_main: import the module in the main interpreter, then import in a subinterpreter. - FAIL -
test_import_in_subinterpreter_before_main: import the module in a subinterpreter (the main interpreter does not import any pybind11 module yet), then import the module in the main interpreter.The import succeeds in the subinterpreter, but the instance is ill-initialized in the main interpreter. If I remove the
get_num_interpreters_seen() > 1entirely or initialize the atomic counter for seen interpreters with2instead of0, my test passes.
What's the purpose of
get_pp_for_main_interpreter()and the associated once flag?
Just to ensure the singleton is initialized once. I'm not sure if the scoped GIL is still working when there are multiple interpreters running. I can revert the call_once approach back to scopded GIL approach.
3a6ec0e to
0830872
Compare
0830872 to
ac02a32
Compare
|
I have updated the approach to move the storage map as a separate The CI tests are looking good so far. But I found some concurrency bugs for this patch in my downstream PR metaopt/optree#245. I'm not sure if it is coming from I will do further investigation about these. Bug 1: Duplicate native enum registration. ImportError: pybind11::native_enum<...>("PyTreeKind") is already registered!Test Casedef test_import_in_subinterpreter_before_main():
# OK
check_script_in_subprocess(
textwrap.dedent(
"""
import contextlib
import gc
from concurrent import interpreters
subinterpreter = None
with contextlib.closing(interpreters.create()) as subinterpreter:
subinterpreter.exec('import optree')
import optree
del optree, subinterpreter
for _ in range(10):
gc.collect()
""",
).strip(),
output='',
rerun=NUM_FLAKY_RERUNS,
)
# FAIL
check_script_in_subprocess(
textwrap.dedent(
f"""
import contextlib
import gc
import random
from concurrent import interpreters
subinterpreter = subinterpreters = stack = None
with contextlib.ExitStack() as stack:
subinterpreters = [
stack.enter_context(contextlib.closing(interpreters.create()))
for _ in range({NUM_FUTURES})
]
random.shuffle(subinterpreters)
for subinterpreter in subinterpreters:
subinterpreter.exec('import optree')
import optree
del optree, subinterpreter, subinterpreters, stack
for _ in range(10):
gc.collect()
""",
).strip(),
output='',
rerun=NUM_FLAKY_RERUNS,
)
# FAIL
check_script_in_subprocess(
textwrap.dedent(
f"""
import contextlib
import gc
import random
from concurrent import interpreters
subinterpreter = subinterpreters = stack = None
with contextlib.ExitStack() as stack:
subinterpreters = [
stack.enter_context(contextlib.closing(interpreters.create()))
for _ in range({NUM_FUTURES})
]
random.shuffle(subinterpreters)
for subinterpreter in subinterpreters:
subinterpreter.exec('import optree')
import optree
del optree, subinterpreter, subinterpreters, stack
for _ in range(10):
gc.collect()
""",
).strip(),
output='',
rerun=NUM_FLAKY_RERUNS,
)Traceback__________________________________________________________________ test_import_in_subinterpreter_before_main ___________________________________________________________________
def test_import_in_subinterpreter_before_main():
check_script_in_subprocess(
textwrap.dedent(
"""
import contextlib
import gc
from concurrent import interpreters
subinterpreter = None
with contextlib.closing(interpreters.create()) as subinterpreter:
subinterpreter.exec('import optree')
import optree
del optree, subinterpreter
for _ in range(10):
gc.collect()
""",
).strip(),
output='',
rerun=NUM_FLAKY_RERUNS,
)
> check_script_in_subprocess(
textwrap.dedent(
f"""
import contextlib
import gc
import random
from concurrent import interpreters
subinterpreter = subinterpreters = stack = None
with contextlib.ExitStack() as stack:
subinterpreters = [
stack.enter_context(contextlib.closing(interpreters.create()))
for _ in range({NUM_FUTURES})
]
random.shuffle(subinterpreters)
for subinterpreter in subinterpreters:
subinterpreter.exec('import optree')
import optree
del optree, subinterpreter, subinterpreters, stack
for _ in range(10):
gc.collect()
""",
).strip(),
output='',
rerun=NUM_FLAKY_RERUNS,
)
concurrent/test_subinterpreters.py:267:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
script = "import contextlib\nimport gc\nimport random\nfrom concurrent import interpreters\n\nsubinterpreter = subinterpreters ...optree')\n\nimport optree\n\ndel optree, subinterpreter, subinterpreters, stack\nfor _ in range(10):\n gc.collect()"
def check_script_in_subprocess(script, /, *, output, env=None, cwd=TEST_ROOT, rerun=1):
result = ''
for _ in range(rerun):
try:
result = subprocess.check_output(
[sys.executable, '-Walways', '-Werror', '-c', script],
stderr=subprocess.STDOUT,
text=True,
encoding='utf-8',
cwd=cwd,
env={
key: value
for key, value in (env if env is not None else os.environ).items()
if not key.startswith(('PYTHON', 'PYTEST', 'COV_'))
},
)
except subprocess.CalledProcessError as ex:
> raise CalledProcessError(ex.returncode, ex.cmd, ex.output, ex.stderr) from None
E helpers.CalledProcessError: Command '['/home/PanXuehai/Projects/optree/venv/bin/python3', '-Walways', '-Werror', '-c', "import contextlib\nimport gc\nimport random\nfrom concurrent import interpreters\n\nsubinterpreter = subinterpreters = stack = None\nwith contextlib.ExitStack() as stack:\n subinterpreters = [\n stack.enter_context(contextlib.closing(interpreters.create()))\n for _ in range(16)\n ]\n random.shuffle(subinterpreters)\n for subinterpreter in subinterpreters:\n subinterpreter.exec('import optree')\n\nimport optree\n\ndel optree, subinterpreter, subinterpreters, stack\nfor _ in range(10):\n gc.collect()"]' returned non-zero exit status 1.
E Output:
E Traceback (most recent call last):
E File "<string>", line 16, in <module>
E import optree
E File "/home/PanXuehai/Projects/optree/optree/__init__.py", line 17, in <module>
E from optree import accessors, dataclasses, functools, integrations, pytree, treespec, typing
E File "/home/PanXuehai/Projects/optree/optree/accessors.py", line 25, in <module>
E import optree._C as _C
E ImportError: pybind11::native_enum<...>("PyTreeKind") is already registered!
_ = 0
cwd = PosixPath('/home/PanXuehai/Projects/optree/tests')
env = None
output = ''
rerun = 8
result = ''
script = "import contextlib\nimport gc\nimport random\nfrom concurrent import interpreters\n\nsubinterpreter = subinterpreters ...optree')\n\nimport optree\n\ndel optree, subinterpreter, subinterpreters, stack\nfor _ in range(10):\n gc.collect()"
helpers.py:187: CalledProcessErrorBug 2: Thread-safety issues for multiple-interpreters are ever created.
|
|
Looks good to me. I haven't had time to read the Cursor outputs in detail for the subinterpreter test, and I haven't so far been able to get a 3.14.2t install functioning to reproduce the problem, but I should be able to sort that out soon. So I'll continue to work on that failing test in #5940. |
|
I added three tests I mentioned in #5933 (comment), and they failed in CI. I can disable them for now if they are kind of out of the scope of this PR. |
| // This is only valid if there is a single interpreter. Otherwise, it is not used. | ||
| // WARNING: We cannot use thread local cache similar to `internals_pp_manager::internals_p_tls` | ||
| // because the thread local storage cannot be explicitly invalidated when interpreters | ||
| // are destroyed (unlike `internals_pp_manager` which has explicit hooks for that). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If internals_pp_manager can do it, why can't we?
Should the map be a member of local_internals so we can benefit from the caching and finalization logic that already exists there? We can add things to local internals without bumping the internals version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are still some bugs with the internals_pp_manager. internals and local_internals are not reliable for now. There is still some work that needs to be done to ensure it works correctly under the multiple-interpreters case. So I added this warning comment, and maybe we can do some improvement in the future. The first priority for this PR is to make things work right.
For example, in the comment #5933 (comment) and the CI failure https://github.com/pybind/pybind11/actions/runs/20414341520?pr=5933, py::native_enum<...> in import mod complains that the enum type is already registered. However, the registration map is stored in internals and local_internals, which should be per-interpreter dependent.
pybind11/include/pybind11/native_enum.h
Lines 31 to 40 in 3aeb113
| if (detail::get_local_type_info(typeid(EnumType)) != nullptr | |
| || detail::get_global_type_info(typeid(EnumType)) != nullptr) { | |
| pybind11_fail( | |
| "pybind11::native_enum<...>(\"" + enum_name_encoded | |
| + "\") is already registered as a `pybind11::enum_` or `pybind11::class_`!"); | |
| } | |
| if (detail::global_internals_native_enum_type_map_contains(enum_type_index)) { | |
| pybind11_fail("pybind11::native_enum<...>(\"" + enum_name_encoded | |
| + "\") is already registered!"); | |
| } |
This indicates there exist bugs in internals_pp_manager. The bugs are not related to gil_safe_call_once_and_store.
I suspect that the unref and destroy hooks for internal_pp_manager only work for interpreters created by the pybind11::subinterpreter API. The tls storage will be malformed when the interpreters are created/destroyed from the Python side. For example, using concurrent.interpreters and concurrent.futures.InterpreterPoolExecutor and internal_pp_manager is not aware that the interpreter is destroyed from the Python side.
cc @b-pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the unref and destroy hooks for internal_pp_manager only work for interpreters created by the pybind11::subinterpreter API. The tls storage will be malformed when the interpreters are created/destroyed from the Python side.
That does seem likely. Can we fix it for everyone, instead of working around it for this specific case while leaving a problem in other areas? I believe everything would work out if we do per-interpreter cleanup from the internals capsule destructor, since clearing the interpreter state dict is one of the last actions on interpreter shutdown.
This commit improves the C++ test infrastructure to ensure test output is visible in CI logs, and disables a test that hangs on free-threaded Python 3.14+. Changes: ## CI/test infrastructure improvements - .github/workflows: Added `timeout-minutes: 3` to all C++ test steps to prevent indefinite hangs. - tests/**/CMakeLists.txt: Added `USES_TERMINAL` to C++ test targets (cpptest, test_cross_module_rtti, test_pure_cpp) to ensure output is shown immediately rather than buffered and possibly lost on crash/timeout. - tests/test_with_catch/catch.cpp: Added a custom Catch2 progress reporter with timestamps, Python version info, and a SIGTERM handler to make test execution and failures clearly visible in CI logs. ## Disabled hanging test - The "Move Subinterpreter" test is disabled on free-threaded Python 3.14+ due to a hang in Py_EndInterpreter() when the subinterpreter is destroyed from a different thread than it was created on. Work on fixing the underlying issue will continue under PR pybind#5940. Context: We were in the dark for months (since we started testing with Python 3.14t) because CI logs gave no clue about the root cause of hangs. This led to ignoring intermittent hangs (mostly on macOS). Our hand was forced only with the Python 3.14.1 release, when hangs became predictable on all platforms. For the full development history of these changes, see PR pybind#5933.
|
After seeing @oremanj's comments, I decided to break out #5942, so we get rid of the 90 minute hangs asap, and keep this PR focused. I'll merge #5942 when I see that the CI is green there. @XuehaiPan there will be some conflicts because I added a couple commits under #5942 that are not in this PR (mainly to report SKIPPING the Move Interpreter test). To resolve them, simply git checkout the affected files from then master. |
I haven't merged yet; the CI for 5942 is still running. This PR currently overlaps with 5942. But after 5942 is merged into master, and you update this PR by merging master, the overlap will be gone, and this PR will be much more focused (smaller) again. |
…p hanging Move Subinterpreter test (#5942) * Improve C++ test infrastructure and disable hanging test This commit improves the C++ test infrastructure to ensure test output is visible in CI logs, and disables a test that hangs on free-threaded Python 3.14+. Changes: ## CI/test infrastructure improvements - .github/workflows: Added `timeout-minutes: 3` to all C++ test steps to prevent indefinite hangs. - tests/**/CMakeLists.txt: Added `USES_TERMINAL` to C++ test targets (cpptest, test_cross_module_rtti, test_pure_cpp) to ensure output is shown immediately rather than buffered and possibly lost on crash/timeout. - tests/test_with_catch/catch.cpp: Added a custom Catch2 progress reporter with timestamps, Python version info, and a SIGTERM handler to make test execution and failures clearly visible in CI logs. ## Disabled hanging test - The "Move Subinterpreter" test is disabled on free-threaded Python 3.14+ due to a hang in Py_EndInterpreter() when the subinterpreter is destroyed from a different thread than it was created on. Work on fixing the underlying issue will continue under PR #5940. Context: We were in the dark for months (since we started testing with Python 3.14t) because CI logs gave no clue about the root cause of hangs. This led to ignoring intermittent hangs (mostly on macOS). Our hand was forced only with the Python 3.14.1 release, when hangs became predictable on all platforms. For the full development history of these changes, see PR #5933. * Add test summary to progress reporter Print the total number of test cases and assertions at the end of the test run, making it easy to spot if tests are disabled or added. Example output: [ PASSED ] 20 test cases, 1589 assertions. * Add PYBIND11_CATCH2_SKIP_IF macro to skip tests at runtime Catch2 v2 doesn't have native skip support (v3 does with SKIP()). This macro allows tests to be skipped with a visible message while still appearing in the test list. Use this for the Move Subinterpreter test on free-threaded Python 3.14+ so it shows as skipped rather than being conditionally compiled out. Example output: [ RUN ] Move Subinterpreter [ SKIPPED ] Skipped on free-threaded Python 3.14+ (see PR #5940) [ OK ] Move Subinterpreter * Fix clang-tidy bugprone-macro-parentheses warning in PYBIND11_CATCH2_SKIP_IF
|
@XuehaiPan I just merged #5942 (the CI was green). There is now a conflict showing here, as expected. Could you please |
|
Quick note: I just removed the "piggy-backed" section from the PR description. |
This reverts commit 534235e.
Description
Fixes #5926
For
!defined(PYBIND11_HAS_SUBINTERPRETER_SUPPORT), the previous behavior is unchanged.For
defined(PYBIND11_HAS_SUBINTERPRETER_SUPPORT), the pybind11internalsis already a per-interpreter-dependent singleton stored in the interpreter's state dict. This PR adds a new mapto the interpreter-dependentfrom theinternalsgil_safe_call_once_and_store's address to the result, where thegil_safe_call_once_and_storeis a static object that has a fixed address.UPDATE: The storage map is moved to a separate
capsulein the per-interpreter state dict instead of putting it as a member ofinternals. Because the per-interpreterinternalsare leaked in the program, the storage map is never GC-ed on (sub-)interpreter shutdown.Suggested changelog entry:
gil_safe_call_once_and_storeto make it safe under multi-interpreters.