Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential speed improvements for SHA512 via BMI2 instructions #640

Open
invd opened this issue Jan 25, 2025 · 0 comments
Open

Potential speed improvements for SHA512 via BMI2 instructions #640

invd opened this issue Jan 25, 2025 · 0 comments

Comments

@invd
Copy link

invd commented Jan 25, 2025

I recently looked into the sha2 crate performance, specifically for performing many consecutive SHA512 calculations on modern x64 processors which do not yet have the brand-new SHA512 instructions mentioned in #634.

As documented in RustCrypto/asm-hashes#83 and RustCrypto/asm-hashes#82, the now-deprecated asm feature target of sha2 0.10.x is slower than the native AVX2 enabled native Rust with intrinsics. Upon closer inspection, this makes sense since the chosen asm code doesn't use AVX or other newer CPU technologies at all.

In comparison with other implementations such as libgcrypt's which have specially optimized asm code like sha512-avx2-bmi2-amd64.S, those are roughly ~25% faster for SHA512 than the sha2 crate in quick benchmarks.

  • Tested on AMD Zen3 Ryzen 5950X under Linux
  • RUSTFLAGS='-C target-cpu=native' cargo +nightly bench -p sha2 has test sha512_10000 [...] = 894 MB/s
  • libgcrypt tests/bench-slope --repetitions 10000 shows 1084 MiB/s
  • The benchmark harnesses may not be fully comparable and have different units, this is just some quick testing to get the relevant ballpark numbers (!)

Another well-known project with this optimization level is the Linux kernel, see arch/x86/crypto/sha512-avx2-asm.S.

Based on observations made as part of RustCrypto/asm-hashes#83 , a potential explanation for this is that the current native optimized Rust code in sha2/src/sha512/x86_avx2.rs uses AVX2, but not BMI2. For the assembler implementations, the BMI2 instruction RORX made a significant performance difference. Also, the terminology is a bit fuzzy here. Since BMI2 seems to be present on all common processors that have AVX2, it's sometimes mentioned as belonging to AVX2, but is technically separate, see Wikipedia.

The bmi2 target feature was around for a while since rust-lang/rust#30462 . I'm not an expert on Rust intrinsics, but the RORX instruction seems to be missing from the current core_arch/src/x86_64/bmi2.rs instructions implemented by core::arch::x86_64?
If the instruction itself isn't available, that may be a major roadblock to using it in sha2 for SHA512. I'm not sure of the exact backstory here, but gnzlbg/bitintr#2 seems to hint at the lack of RORX and other similar instruction availability since 2017, so it doesn't look like a regression.

To summarize, I suspect that once there is support for this particular BMI2 CPU instruction, it may be possible to squeeze additional SHA512 performance out of existing CPUs.
Notably, this does not rely on the more recent AVX512 instruction set or VSHA512 instruction set. It also probably won't be relevant for SHA1/SHA256 where faster mechanisms are commonly available and in use by sha2 on most modern CPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant