-
-
Notifications
You must be signed in to change notification settings - Fork 53
Stack allocated HPyContext
We discussed this first in our HPy dev call on July 8th, 2021 and discussion weren't intense since then but showed up here and there. We recently had more discussion in our Berlin meetup, so I think it's a good time to summarize everything we talked about so far.
Since the HPy context is NOT opaque and crucial for backwards compatibility, it is very important to design it well and be aware that the made design decisions will then be there forever (or at least for a long time). We still make breaking changes to HPy since we still see it in an early phase and we don't have a lot of packages yet that would needed to be fixed if we do breaking changes. However, I think this is changing slowly and we need to agree on the final context structure ASAP.
Right now, HPyContext
is a big (generated) structure looking like this (see also: autogen_ctx.h):
struct _HPyContext_s {
const char *name; // used just to make debugging and testing easier
void *_private; // used by implementations to store custom data
int ctx_version;
// roughly 80 built-in handles; more are being added
HPy h_None;
// ...
// roughly 150 context functions; more are being added
HPy (*ctx_Module_Create)(HPyContext *ctx, HPyModuleDef *def);
// ...
}
HPyContext
is right now already quite large and it will grow further since we keep adding functionality to HPy.
As far as I know, all Python implementations supporting HPy are currently just allocating one universal context and one debug context in (native) heap memory. For this reason, the size of HPyContext
is currently no problem.
However, if we would like to allocate HPyContext
on stack, it will be a concern.
Why should one want to stack-allocate the context?
First of all, HPy only guarantees that the received HPyContext *
(and it's content) is valid just for the current calling context. This is because we wanted to keep the possibility to provide per-call data in the context.
Having this opportunity is IMO very powerful but I will explain that in detail in a later section.
The common way to do so is to allocate data structures on the stack for every call. Since calls may be very frequent and are most certainly crucial for performance, the can only have a stack-allocated HPyContext
if the structure is reasonably small (let's say, a few words).
The proposed structure for HPyContext
to prepare it for stack-allocation is inspired by JNIEnv
(see jni.h). JNIEnv
is very minimal and basically just contains a pointer to the function table.
Hence, the idea is to move all members that will be the same for each all into separate data structures:
struct _HPyContext_s {
const struct _HPyFunctionTable_s *fun_table;
const struct _HPyBuiltinHandleTable_s *handles;
void *_private; // used by implementations to store custom data
}
/* information about the context that is rarely used and mainly for debugging purposes */
struct _HPyContextInfo_s {
const char *name;
int ctx_version;
}
/* table of handles to built-in objects */
struct _HPyBuiltinHandleTable_s {
// roughly 80 built-in handles; more are being added
HPy h_None;
// ...
}
/* the context function table */
struct _HPyFunctionTable_s {
// roughly 150 context functions; more are being added
HPy (*ctx_Module_Create)(HPyContext *ctx, HPyModuleDef *def);
// ...
}
- Provide call-specific data
- Provide thread-local data
- Support sub-interpreters
- Carry data to be able to do upcall specialization
- Provide lifeness scopes for handles
- Call C functions of other HPy extensions
- Fast access (ideally just one indirection) to context members (mostly handles and functions)
- Low overhead for preparing the
HPyContext
for a downcall.
Referring to the list of requirements, stack-allocated HPyContext
structs are able to fulfill many of the requirements.
Providing call-specific data is easy since we can just allocate a fresh context on the stack for every downcall. In order to ensure that the preparation of the context has low overhead for a downcall, the handle and function tables as well as the context meta info will just be shared.
Since it already can provide call-specific data, that data can also be thread-local.
Sub-interpreters are supported as well.
Handle scopes are also possible. The context may have it's own handle table that is used during the downcall.
However, a stack-allocated context performance worse on two points: First, access to context members is a bit worse due to the additional indirection. In order to get a built-in handle, we now need two memory loads (load built-in handle table pointer + load built-in handle). This could be an unacceptable performance regression. Second, support for upcall specializations is a bit annoying since the called runtime function cannot just use the context's pointer for caching (since that will be different for every downcall) but the context needs to carry some extra data (maybe a token) for that.
Calling C functions of other HPy extensions is just the same as with every other context.
We can achieve most of the goals with heap-allocated contexts as well. However, that does not happen automatically but we need to have some context caching and the necessary management. The idea here is: every time when a downcall happens, we fetch a currently unused (heap-allocated) context from some (lock-free) cache and patch it appropriately. The preparation for the downcall will also have a low overhead since as for the stack-allocated context, we will reuse all built-in handles and function pointers and everything but call-specific data. In order to have thread-local data, we can just have a context cache per thread. Same for the subinterpreters. The big advantage of this approach is the fast member access since there is just one indirection starting with the context pointer.
Support for upcall specializations is still a bit annoying since the called runtime function still cannot just use the context's pointer for caching because we are fetching a pre-allocated context from the context cache (doesn't need to be the same one every time). So, we need to carry some extra data (maybe a token) for that as well.
- Satisfies requirements 1., 2., 3., 5., 8. out of the box.
- Major (expected) drawback: the additional indirection (and implied performance penalty) for accessing built-in handles and functions.
- Satisfies requirements 1., 2., 3., 5., 7., 8. (but not out of the box).
- Major drawback: Context caching is strongly required and this may very complex.
In order to make a decision on how we should switch to stack-allocated context (using the proposed structure), we need to do benchmarks on the performance impact due to the additional indirection overhead.
My expectation is that since the HPyContext
is then stack-allocated and since the stack is mostly in the CPU's L1 cache, the first indirection is very cheap (like just a few CPU cycles) whereas this could be way worse for a heap-allocated context.
I expect the second indirection to be as expensive as accessing the heap-allocated struct with a reasonable chance of even better performance since we can now mark the whole built-in handle and function tables to be constant. That would put less caching restrictions on those.
But that remains to be shown.
Please see also some description/analysis about JNIEnv
here:
-
The Java Native Interface in
Section 11.5 "The JNIEnv Interface Pointer"
- 5 September 2024
- 4 April 2024
- 7 March 2024
- 1 February 2024
- 11 January 2024
- 7 December 2023
- 9 November 2023
- 5 October 2023
- 14 September 2023
- 3 August 2023
- 6 July 2023
- 1 June 2023
- 4 May 2023
- 13 April 2023
- 2 March 2023
- 2 February 2023
- 12 January 2023
- 1 December 2022
- 3 November 2022
- 6 October 2022
- 8 September 2022
- 4 August 2022
- 7 July 2022
- 2 June 2022
- 5 May 2022
- 7 April 2022
- 3 March 2022
- 3 February 2022
- 13 January 2022
- 2 December 2021
- 4 November 2021
- 7 October 2021
- 2 September 2021
- 12 August 2021
- 8 July 2021
- 6 May 2021
- 4 March 2021
- 7 January 2021
- 3 December 2020
- 5 November 2020