-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gxhash #279
base: master
Are you sure you want to change the base?
Add gxhash #279
Conversation
I'll check what we can do. Esp. The SIMD width discrepancy is unfortunate |
We can stick to 128-bit width SIMD intrinsics on X86, but at the cost of a lower throughput and I think it's best in terms of usability to hide this to the user (have the algorithm pick whatever instruction set is available to reach the maximum throughput it can)
Here are some raw numbers for this platform:
|
For now I see two options: Option 1: Make
|
After the latest tweaks I was able to squeeze out a little more performance while still passing all tests. It would be difficult for me to make it any faster without affecting quality. I think I'll stick with that implementation.
Now the question is whether I should opt for option 1 (like t1ha) or option 2 (and how to fix the CI 😅) |
This reverts commit 0e4ce59.
I just pushed one more optimization (still passing the SMHasher tests ofc). I also took the opportunity to run https://github.com/tkaitchuck/aHash since it was advertised as "the fastest, DOS-resistant hash currently available in Rust". Here are the results, still running on my Macbook M1 pro (ARM64):
Here are the numbers I got for the same 128-bit state version of the algorithm but on x86 (Ryzen 5).
The 256-bit state version should be even faster, but my PC does not support VAES instructions, so I need to find another beast to experiment it. |
Hi @rurban, while finalizing my work on gxHash64 I noticed very different timings from my benchmark setup compared to the SMHasher speed test. For instance, I get worse performance for FNV1a for small sizes until inputs of size 128 bytes, and then throughput degrades as size continues to increase. From all FNV1a benchmarks I've seen FNV1a 64 performs the best for inputs for size 8 and then throughput decreases as size increases.
I think this is because of the loop overhead itself in the
Benchmarking method that execute in a very small number of cycles seems quite difficult. Unrolling may prevent inlining at some point but I'm guessing there are other ways, like substracting the loop overhead to the timing (getting the overhead with an empty warmup of by estimating it, I'm not sure what is the most accurate way) What do you think? Note: This is ran on an a Macbook so aarch64 |
I have been working on a non-cryptographic hash algorithm with performance in mind. The idea is to leverage modern hardware capabilities as much as possible for maximum throughput (SIMD instrinsics, ILP, and some tricks).
For the story, I don't consider myself an expert in cryptography, but I simply fell down the rabbit hole... At the beginning, I had as an objective to improve a simple hash algorithm in C#. Limited by the possibilities offered by C#, but teased by the possibilities I envisioned on the way, I rewrote it in Rust. Optimization after optimization, the algorithm started to outperform many of its counterparts, and so at this point I decided to give it a name and to write a paper on it (it's still a rough draft!).
The algorithm is named GxHash and has the following features:
The algorithm isn't stabilized yet: the version I have reimplemented in C and included in this PR is a variation with a more robust
compress
implementation. With this tweak, the algorithm seems to pass all SMHasher tests, but at the cost of performance. Still, on my laptop (Macbook M1 pro) it seems like it still outperforms all of its counterparts. Possibly it can be made as robust while not giving off as much performance, this is WIP.One important thing about this algorithm also is that in its current form, it will generate different hashes on two machines with different SIMD register width. So it's best when used "in-process", for hashtables for instance.
Regarding SMHasher and this PR, I had to cheat a bit to be able to build on my ARM Macbook, what's the recommended way to build / testall on this platform ? @rurban
Any feedback is welcome on the PR and even the algorithm itself, I did all of this by myself, and so I may have missed something!