You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I recently looked into the sha2 crate performance, specifically for performing many consecutive SHA512 calculations on modern x64 processors which do not yet have the brand-new SHA512 instructions mentioned in #634.
As documented in RustCrypto/asm-hashes#83 and RustCrypto/asm-hashes#82, the now-deprecated asm feature target of sha20.10.x is slower than the native AVX2 enabled native Rust with intrinsics. Upon closer inspection, this makes sense since the chosen asm code doesn't use AVX or other newer CPU technologies at all.
In comparison with other implementations such as libgcrypt's which have specially optimized asm code like sha512-avx2-bmi2-amd64.S, those are roughly ~25% faster for SHA512 than the sha2 crate in quick benchmarks.
Tested on AMD Zen3 Ryzen 5950X under Linux
RUSTFLAGS='-C target-cpu=native' cargo +nightly bench -p sha2 has test sha512_10000 [...] = 894 MB/s
Based on observations made as part of RustCrypto/asm-hashes#83 , a potential explanation for this is that the current native optimized Rust code in sha2/src/sha512/x86_avx2.rs uses AVX2, but not BMI2. For the assembler implementations, the BMI2 instruction RORX made a significant performance difference. Also, the terminology is a bit fuzzy here. Since BMI2 seems to be present on all common processors that have AVX2, it's sometimes mentioned as belonging to AVX2, but is technically separate, see Wikipedia.
The bmi2 target feature was around for a while since rust-lang/rust#30462 . I'm not an expert on Rust intrinsics, but the RORX instruction seems to be missing from the current core_arch/src/x86_64/bmi2.rs instructions implemented by core::arch::x86_64?
If the instruction itself isn't available, that may be a major roadblock to using it in sha2 for SHA512. I'm not sure of the exact backstory here, but gnzlbg/bitintr#2 seems to hint at the lack of RORX and other similar instruction availability since 2017, so it doesn't look like a regression.
To summarize, I suspect that once there is support for this particular BMI2 CPU instruction, it may be possible to squeeze additional SHA512 performance out of existing CPUs.
Notably, this does not rely on the more recent AVX512 instruction set or VSHA512 instruction set. It also probably won't be relevant for SHA1/SHA256 where faster mechanisms are commonly available and in use by sha2 on most modern CPUs.
The text was updated successfully, but these errors were encountered:
I recently looked into the
sha2
crate performance, specifically for performing many consecutive SHA512 calculations on modern x64 processors which do not yet have the brand-new SHA512 instructions mentioned in #634.As documented in RustCrypto/asm-hashes#83 and RustCrypto/asm-hashes#82, the now-deprecated
asm
feature target ofsha2
0.10.x
is slower than the native AVX2 enabled native Rust with intrinsics. Upon closer inspection, this makes sense since the chosenasm
code doesn't use AVX or other newer CPU technologies at all.In comparison with other implementations such as
libgcrypt
's which have specially optimizedasm
code likesha512-avx2-bmi2-amd64.S
, those are roughly ~25% faster for SHA512 than thesha2
crate in quick benchmarks.Ryzen 5950X
under LinuxRUSTFLAGS='-C target-cpu=native' cargo +nightly bench -p sha2
hastest sha512_10000 [...] =
894 MB/slibgcrypt
tests/bench-slope --repetitions 10000
shows 1084 MiB/sAnother well-known project with this optimization level is the Linux kernel, see arch/x86/crypto/sha512-avx2-asm.S.
Based on observations made as part of RustCrypto/asm-hashes#83 , a potential explanation for this is that the current native optimized Rust code in
sha2/src/sha512/x86_avx2.rs
usesAVX2
, but notBMI2
. For the assembler implementations, theBMI2
instructionRORX
made a significant performance difference. Also, the terminology is a bit fuzzy here. Since BMI2 seems to be present on all common processors that have AVX2, it's sometimes mentioned as belonging to AVX2, but is technically separate, see Wikipedia.The
bmi2
target feature was around for a while since rust-lang/rust#30462 . I'm not an expert on Rust intrinsics, but theRORX
instruction seems to be missing from the current core_arch/src/x86_64/bmi2.rs instructions implemented bycore::arch::x86_64
?If the instruction itself isn't available, that may be a major roadblock to using it in
sha2
for SHA512. I'm not sure of the exact backstory here, but gnzlbg/bitintr#2 seems to hint at the lack ofRORX
and other similar instruction availability since 2017, so it doesn't look like a regression.To summarize, I suspect that once there is support for this particular BMI2 CPU instruction, it may be possible to squeeze additional SHA512 performance out of existing CPUs.
Notably, this does not rely on the more recent AVX512 instruction set or
VSHA512
instruction set. It also probably won't be relevant for SHA1/SHA256 where faster mechanisms are commonly available and in use bysha2
on most modern CPUs.The text was updated successfully, but these errors were encountered: