Potential speed improvements for SHA512 via BMI2 instructions #640

invd · 2025-01-25T15:03:53Z

I recently looked into the sha2 crate performance, specifically for performing many consecutive SHA512 calculations on modern x64 processors which do not yet have the brand-new SHA512 instructions mentioned in #634.

As documented in RustCrypto/asm-hashes#83 and RustCrypto/asm-hashes#82, the now-deprecated asm feature target of sha2 0.10.x is slower than the native AVX2 enabled native Rust with intrinsics. Upon closer inspection, this makes sense since the chosen asm code doesn't use AVX or other newer CPU technologies at all.

In comparison with other implementations such as libgcrypt's which have specially optimized asm code like sha512-avx2-bmi2-amd64.S, those are roughly ~25% faster for SHA512 than the sha2 crate in quick benchmarks.

Tested on AMD Zen3 Ryzen 5950X under Linux
RUSTFLAGS='-C target-cpu=native' cargo +nightly bench -p sha2 has test sha512_10000 [...] = 894 MB/s
libgcrypt tests/bench-slope --repetitions 10000 shows 1084 MiB/s
The benchmark harnesses may not be fully comparable and have different units, this is just some quick testing to get the relevant ballpark numbers (!)

Another well-known project with this optimization level is the Linux kernel, see arch/x86/crypto/sha512-avx2-asm.S.

Based on observations made as part of RustCrypto/asm-hashes#83 , a potential explanation for this is that the current native optimized Rust code in sha2/src/sha512/x86_avx2.rs uses AVX2, but not BMI2. For the assembler implementations, the BMI2 instruction RORX made a significant performance difference. Also, the terminology is a bit fuzzy here. Since BMI2 seems to be present on all common processors that have AVX2, it's sometimes mentioned as belonging to AVX2, but is technically separate, see Wikipedia.

The bmi2 target feature was around for a while since rust-lang/rust#30462 . I'm not an expert on Rust intrinsics, but the RORX instruction seems to be missing from the current core_arch/src/x86_64/bmi2.rs instructions implemented by core::arch::x86_64?
If the instruction itself isn't available, that may be a major roadblock to using it in sha2 for SHA512. I'm not sure of the exact backstory here, but gnzlbg/bitintr#2 seems to hint at the lack of RORX and other similar instruction availability since 2017, so it doesn't look like a regression.

To summarize, I suspect that once there is support for this particular BMI2 CPU instruction, it may be possible to squeeze additional SHA512 performance out of existing CPUs.
Notably, this does not rely on the more recent AVX512 instruction set or VSHA512 instruction set. It also probably won't be relevant for SHA1/SHA256 where faster mechanisms are commonly available and in use by sha2 on most modern CPUs.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential speed improvements for SHA512 via BMI2 instructions #640

Potential speed improvements for SHA512 via BMI2 instructions #640

invd commented Jan 25, 2025 •

edited

Loading

Potential speed improvements for SHA512 via BMI2 instructions #640

Potential speed improvements for SHA512 via BMI2 instructions #640

Comments

invd commented Jan 25, 2025 • edited Loading

invd commented Jan 25, 2025 •

edited

Loading