Skip to content

Conversation

@hanno-becker
Copy link
Contributor

@hanno-becker hanno-becker commented Dec 14, 2025

#1389 plus consolidation of multiple allocations into workspaces.

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 10 times, most recently from f103112 to 93d235d Compare December 15, 2025 05:40
@hanno-becker hanno-becker marked this pull request as ready for review December 15, 2025 05:41
@hanno-becker hanno-becker requested a review from a team as a code owner December 15, 2025 05:41
@hanno-becker hanno-becker marked this pull request as draft December 15, 2025 05:41
@hanno-becker
Copy link
Contributor Author

@davidchisnall Seeing this will be relevant to your use case, you may want to have a look. It's understood that you'll need a context parameter as well, but this can be a follow-up PR.

@hanno-becker hanno-becker marked this pull request as ready for review December 15, 2025 05:54
@hanno-becker hanno-becker marked this pull request as draft December 15, 2025 05:57
@hanno-becker
Copy link
Contributor Author

hanno-becker commented Dec 15, 2025

Still needs work since CBMC can't yet handle the consolidation into workspaces. Done in #1388

Strictly speaking, the introduction of workspace structs could be deferred, since for now we test the custom heap-based allocation with C11 only and it's not part of the main source. Trialed this in #1389

@davidchisnall
Copy link

I don't like the goto cleanup pattern, because it isn't something I can use for other kinds of error handling. That might be something that would be possible to fix in a later commit but, for example, we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow). We'd also like to be able to use the same mechanism to report that getting entropy failed (currently it's assumed to succeed, which means that the system fails open).

There is also a tradeoff in consolidating the allocations. Fewer calls to the allocator will be better for performance, but larger allocations are more likely to fail. I put the performance / stack usage numbers in this post for the Verilator simulator of the slowest CHERIoT core. The signing and verification steps are 4-5M cycles, so a few hundred extra cycles for heap allocation is in the noise, but failing heap allocation is a problem.

@hanno-becker
Copy link
Contributor Author

I don't like the goto cleanup pattern, because it isn't something I can use for other kinds of error handling. That might be something that would be possible to fix in a later commit but, for example, we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow). We'd also like to be able to use the same mechanism to report that getting entropy failed (currently it's assumed to succeed, which means that the system fails open).

Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined -1 but which I hope to split up in v2) and jump to cleanup upon RNG failure.

What would your suggestion be?

@davidchisnall
Copy link

Why would the same mechanism not work for entropy failure? I was hoping that we could similarly set an error code (which is currently an unrefined -1 but which I hope to split up in v2) and jump to cleanup upon RNG failure.

Yes, that would work. Currently the get-entropy function has a void return and assumes success. If allocation is in the same function as an entropy call, then it could also goto the same cleanup point.

@hanno-becker
Copy link
Contributor Author

we'd like running out of stack space to do a longjmp and free all heap allocations that were created on the way (we have some SEH-like error handling that handles stack unwinding with some on-stack jump buffers, which is triggered automatically on stack overflow)

Does this need any co-operation from the application/library?

If not, it seems the current goto cleanup approach might be OK, if we also use it for other errors such as RNG failure.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 7d16cd7 Previous: 71773d9 Ratio
ML-KEM-512 keypair 38968 cycles 38914 cycles 1.00
ML-KEM-512 encaps 44887 cycles 44730 cycles 1.00
ML-KEM-512 decaps 56701 cycles 56823 cycles 1.00
ML-KEM-768 keypair 64330 cycles 65610 cycles 0.98
ML-KEM-768 encaps 72031 cycles 72813 cycles 0.99
ML-KEM-768 decaps 87999 cycles 87757 cycles 1.00
ML-KEM-1024 keypair 96164 cycles 96052 cycles 1.00
ML-KEM-1024 encaps 106319 cycles 106405 cycles 1.00
ML-KEM-1024 decaps 126938 cycles 126654 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 7d16cd7 Previous: 71773d9 Ratio
ML-KEM-512 keypair 36515 cycles 36187 cycles 1.01
ML-KEM-512 encaps 43046 cycles 42619 cycles 1.01
ML-KEM-512 decaps 56148 cycles 55403 cycles 1.01
ML-KEM-768 keypair 59998 cycles 59471 cycles 1.01
ML-KEM-768 encaps 68176 cycles 67728 cycles 1.01
ML-KEM-768 decaps 85747 cycles 84838 cycles 1.01
ML-KEM-1024 keypair 88749 cycles 88148 cycles 1.01
ML-KEM-1024 encaps 99071 cycles 98356 cycles 1.01
ML-KEM-1024 decaps 121586 cycles 120110 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 7d16cd7 Previous: 71773d9 Ratio
ML-KEM-512 keypair 59383 cycles 59377 cycles 1.00
ML-KEM-512 encaps 68529 cycles 68765 cycles 1.00
ML-KEM-512 decaps 88650 cycles 87486 cycles 1.01
ML-KEM-768 keypair 99659 cycles 99614 cycles 1.00
ML-KEM-768 encaps 111486 cycles 111493 cycles 1.00
ML-KEM-768 decaps 136633 cycles 136258 cycles 1.00
ML-KEM-1024 keypair 149176 cycles 149294 cycles 1.00
ML-KEM-1024 encaps 164998 cycles 165110 cycles 1.00
ML-KEM-1024 decaps 196336 cycles 196272 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Details
Benchmark suite Current: 7d16cd7 Previous: 71773d9 Ratio
ML-KEM-512 keypair 50706 cycles 50856 cycles 1.00
ML-KEM-512 encaps 58961 cycles 58653 cycles 1.01
ML-KEM-512 decaps 74965 cycles 75093 cycles 1.00
ML-KEM-768 keypair 88105 cycles 87534 cycles 1.01
ML-KEM-768 encaps 95599 cycles 95539 cycles 1.00
ML-KEM-768 decaps 118572 cycles 119501 cycles 0.99
ML-KEM-1024 keypair 130239 cycles 130561 cycles 1.00
ML-KEM-1024 encaps 142448 cycles 142169 cycles 1.00
ML-KEM-1024 decaps 173815 cycles 173715 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker hanno-becker changed the title Make allocation of large structures/buffers configurable Make allocation of large structures/buffers configurable (with workspaces) Dec 16, 2025
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks

Details
Benchmark suite Current: 7d16cd7 Previous: 71773d9 Ratio
ML-KEM-512 keypair 155119 cycles 155354 cycles 1.00
ML-KEM-512 encaps 163151 cycles 163232 cycles 1.00
ML-KEM-512 decaps 206149 cycles 206517 cycles 1.00
ML-KEM-768 keypair 261039 cycles 261154 cycles 1.00
ML-KEM-768 encaps 275418 cycles 275821 cycles 1.00
ML-KEM-768 decaps 338056 cycles 337878 cycles 1.00
ML-KEM-1024 keypair 395001 cycles 395312 cycles 1.00
ML-KEM-1024 encaps 421934 cycles 423303 cycles 1.00
ML-KEM-1024 decaps 505554 cycles 507323 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@hanno-becker
Copy link
Contributor Author

hanno-becker commented Dec 16, 2025

I changed the approach to use an explicit error check in the code rather than expecting MLK_CUSTOM_ALLOC to do this. This makes it easier to switch between stack and heap allocation without running into C90 errors because of mixed code and declarations. It also gives us control over the error code, while the user-provided MLK_CUSTOM_ALLOC can follow the standard convention that failure is reported through a NULL return value.

Also, this is now based on #1389 which doesn't introduce workspaces.

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 3 times, most recently from a876fb7 to 7bbba13 Compare December 21, 2025 19:20
@hanno-becker hanno-becker changed the title Make allocation of large structures/buffers configurable (with workspaces) Bundle multiple allocations into workspaces Dec 21, 2025
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanno-becker, one small nit on the code, otherwise this looks good to me.

Two points remain to be discussed:

  1. The previous approach would allow a user to allocate different buffers in different memories, e.g., one could allocate the large matrix A on the heap while all other buffers stay on the stack. This is no longer possible with the workspace approach. This may be particularly relevant for embedded systems.
  2. As @davidchisnall noted here allocating everything at once is more likely to fail which is undesirable.

Any thoughts @davidchisnall, @gilles-peskine-arm, @hanno-becker?

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 2 times, most recently from 2c78165 to 793b43d Compare December 29, 2025 19:25
@hanno-becker
Copy link
Contributor Author

hanno-becker commented Dec 30, 2025

Unless @davidchisnall @gilles-peskine-arm have concrete use cases in mind where we know that the monolithic allocation will cause problems (?), I would suggest to add workspaces for now and adjust things on demand. This is internal and can be changed at any time. Esp. for mldsa-native I am hoping that this will also help us to reduce CBMC overhead from dynamic allocation which is currently blocking proofs.

@gilles-peskine-arm
Copy link

I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement.

Let's consider the three typical types of memory that might be used.

  • Stack: if everything is on the stack, then monolithic or fragmented doesn't matter if the monolithic workspace is implemented optimally, meaning that it doesn't use more space than the stack would. E.g. a monolithic workspace can waste memory if it ends up being struct {a; b; c;} and you know that only two of a, b and c can ever be active at the same time. I would guess that this doesn't happen with crypto that is constant-time, because that tends to happen when you have code paths that depend on the data. But I'm not familiar enough with MLKEM and MLDSA to know if it's a problem.
  • Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.
  • Global: if I have to reserve some global area for a costly operation, I'll just designate one area (or maybe one area per CPU core). I can't see any advantage for fragmentation.

There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer).

In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all if (failed) goto cleanup adds up.

@mkannwischer
Copy link
Contributor

I very much prefer the monolithic allocation approach, unless there is an identified use case for fragmented allocation or monolithic allocation is too hard to implement.

Let's consider the three typical types of memory that might be used.

  • Stack: if everything is on the stack, then monolithic or fragmented doesn't matter if the monolithic workspace is implemented optimally, meaning that it doesn't use more space than the stack would. E.g. a monolithic workspace can waste memory if it ends up being struct {a; b; c;} and you know that only two of a, b and c can ever be active at the same time. I would guess that this doesn't happen with crypto that is constant-time, because that tends to happen when you have code paths that depend on the data. But I'm not familiar enough with MLKEM and MLDSA to know if it's a problem.
  • Heap: a monolithic allocation saves a bit of RAM overhead, but can be problematic if your heap is already fragmented. I think that if fragmenting the allocations is the only way your application can run, you're already too tight on memory and even a small change in the order of operations, or just letting your application run a little longer, would cause heap exhaustion even with smaller fragments.
  • Global: if I have to reserve some global area for a costly operation, I'll just designate one area (or maybe one area per CPU core). I can't see any advantage for fragmentation.

There may be a reason to use different types of memory for some data. If a system has a small amount of memory that's better protected against snooping, that memory should be prioritized for key-equivalent secret data over message-specific secret data during signing. But I think if you want that level of control, you should be prepared to maintain your own platform-specific cryptography code, or at least your own patches. It's not a level of control that we intend to provide in TF-PSA-Crypto (whereas we would like to give a choice between stack and a user-provided buffer).

In addition, I would prefer to avoid having low-level functions that can fail: ideally some high-level function would reserve the necessary amount of memory (and generate the random data), and then all low-level functions can't fail. One benefit of that is that it reduces the risk of doing error handling wrong, although having a CBMC proof mostly mitigates that risk. Another benefit is the code size: the cost of all if (failed) goto cleanup adds up.

Thanks @gilles-peskine-arm for this extensive comment. Makes sense to me.
In that case we go ahead an merge the workspaces in mlkem-native now and also try to get that to work in mldsa-native.

Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanno-becker. LGTM.

@hanno-becker hanno-becker force-pushed the alloc_free_config branch 2 times, most recently from b0b7a86 to c37d1ea Compare January 1, 2026 17:34
This commit introduces 'workspace structures', which are
are function-local structures wrapping all objects and buffers
utilized by the function into a single struct.

The benefit of the workspace struct is a simpler allocation
boilerplate: Instead of allocating multiple structures and
separately handling their success/failure, we now only need
to allocate a single object per function.

Signed-off-by: Hanno Becker <[email protected]>
@hanno-becker hanno-becker force-pushed the alloc_free_config branch 3 times, most recently from 7840809 to 76a56c6 Compare January 1, 2026 20:18
@hanno-becker hanno-becker merged commit e379df7 into main Jan 2, 2026
779 checks passed
@hanno-becker hanno-becker deleted the alloc_free_config branch January 2, 2026 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark this PR should be benchmarked in CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants