feat: POWER8 hardware AES via vcipher/vcipherlast (ISA 2.07) — 13x speedup#9932
Open
Scottcjn wants to merge 4 commits intowolfSSL:masterfrom
Open
feat: POWER8 hardware AES via vcipher/vcipherlast (ISA 2.07) — 13x speedup#9932Scottcjn wants to merge 4 commits intowolfSSL:masterfrom
Scottcjn wants to merge 4 commits intowolfSSL:masterfrom
Conversation
Uses ISA 2.07 crypto instructions (vcipher, vcipherlast, vncipher, vncipherlast, vsbox, vpmsumd) instead of scalar T-table approach. 8-way pipeline fills vcipher 7-cycle latency for parallelizable modes. Vectorized counter increment stays in registers (no memory round-trip). Benchmarked on IBM POWER8 S824 (8286-42A): - AES-128-CTR 8-way: 3,595 MiB/s (vs 262 MiB/s T-table = 13.7x) - AES-128-CBC-dec 8-way: 2,796 MiB/s (vs 213 MiB/s = 13.2x) - AES-128-ECB 8-way: 2,931 MiB/s (vs 265 MiB/s = 11.0x) - AES-128-CBC-enc serial: 484 MiB/s (vs 267 MiB/s = 1.8x) All correctness tests pass (CBC + CTR round-trips at 1MB). Co-authored-by: OpenAI GPT-5.4 (vectorized counter increment, 8-way pipeline)
|
Can one of the admins verify this patch? |
Wrap entire file in #if defined(__powerpc64__) so it compiles
cleanly on non-PPC targets (Apple M1, x86, ARM).
Move benchmark main() behind #ifdef POWER8_AES_BENCHMARK.
Add wolfSSL license header.
To build standalone benchmark:
gcc -mcpu=power8 -maltivec -mvsx -O3 -DPOWER8_AES_BENCHMARK \
-o power8_aes_bench ppc64-aes-power8-crypto.c -lrt
Contributor
|
Hi @Scottcjn, We would be thrilled to have these code changes but need a contributor agreement. Thanks, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds POWER8 hardware-accelerated AES using the ISA 2.07 vector crypto instructions (
vcipher,vcipherlast,vncipher,vncipherlast,vsbox,vpmsumd). These instructions have been available since POWER8 (2013) and provide single-cycle AES round operations.The current PPC64 ASM approach in PR #9852 uses scalar T-table AES with GPR instructions, which is significantly slower than using the hardware crypto unit.
Key Optimizations
vcipher/vcipherlast— single-cycle AES round in the vector crypto unitvcipherlatency gap (1 cycle throughput × 8 chains = full pipeline utilization)vec_addstays in registers — eliminates store-load round-trip in CTR modeBenchmark Results — IBM POWER8 S824 (8286-42A)
Hardware: Dual 8-core POWER8, 512GB RAM, Ubuntu 20.04, GCC 9.4.0
Build:
gcc -mcpu=power8 -maltivec -mvsx -O3 -mtune=power8 -funroll-loopsvs PR #9852 T-table (best configuration: NO_HARDEN, -O3, aesgcm=table)
Full results by key size
Why hardware crypto instead of T-tables?
vcipher/vcipherlasthave been available since POWER8 (ISA 2.07, 2013) — covers all 64-bit Power Systems in active use__builtin_crypto_*intrinsics — no inline assembly, no Ruby code generatorsAdditional finding: GMAC correctness bug in PR #9852
During testing of PR #9852 on POWER8,
testwolfcryptGMAC test fails witherror L=18271when PPC64 ASM is enabled (both hardened and unhardened, both GCM table modes). All tests pass without PPC64 ASM.Integration status
This PR provides the standalone implementation with benchmark harness. Full wolfSSL build system integration (
configure.ac,aes.cdispatch, CPUID detection) can follow as a subsequent PR — wanted to get the core implementation and performance data out for review first.Test plan
cc @SparkiDev — benchmarked alongside your PR #9852 on real POWER8 hardware. The vcipher instruction set is the key differentiator. Happy to collaborate on getting hardware crypto integrated.