Implement an AVX2 accelerated Engine #170

dequbed · 2021-09-15T22:24:38Z

What is this about?

This PR adds a new Engine implementation based on the Wojciech Muła & Daniel Lemire 2017, that makes use of the AVX2 extensions on modern x86 and AMD64/x86_64 CPUs.
It's currently on the order of 200-300% faster than the fast_portable Engine on Intel Skylake and AMD Zen 2. However I've not done any tuning on the engine yet so this is most likely only the lower bound.

What's the catch?

The algorithm makes some very specific assumptions about the alphabet used so it's not possible to configure it to any alphabet outside of STANDARD. (URL_SAFE is also possible but I haven't yet added support on the decoding side).
Because it uses intrinsics the code is unsafe. Because it also has to offset pointers into potentially unallocated space the code is violently unsafe.

Since this crate is marked with #[forbid(unsafe_code)] by default and I found it not justifiable to add new unsafe code to a fundamental crate like this all unsafe code is explicit opt-in. The engine is only enabled with the feature avx2, which in turn requires the feature unsafe (which allows unsafe code).

So, what now?

Right now the code is not in a stage where I'd consider it mergeable, there's a number of TODO's left in the code, at least one FIXME and I'd like to prove the soundness of some parts first. It does however work and passes all of the tests that DEFAULT_ENGINE passes.
But, I'd like to get the discussion with you started on this PR sooner rather than later, especially given that it adds unsafe code and plays with pointers. I would like to see this crate implement intrinsics-accelerated and more generally SIMD enabled Engines but I can also understand why'd you would like to not have any unsafe code at all.

Featurewise there's two more straightforward additions I'm planning for the PR should we decide on adding this engine into this repo at all:

A "literate" extension to the engine, i.e. making it configureable for any alphabet using the fast algorithm where possible and vectorized LUTs otherwise. Using LUTs even with AVX2 won't be as fast as the optimized algorithm, but it should still provide a significant speed boost over a non-SIMD implementation.
Actually put in the efford to tune the code. There's a lot of potential improvements to be had that I haven't had the time to do.

In general it should also be possible to get a similar kind of improvement for other platforms, using e.g. ARM Neon, IBM VSX or the WebAssembly SIMD extension, but I haven't looked into any of those yet so I can't say for certain how much transfers.

Okay, one last question: … But why?

I couldn't sleep Sunday to Monday and writing code that makes CPUs go very fast, hot, and bothered is apparently my way of dealing with that :)

marshallpierce · 2021-09-16T00:22:35Z

This sort of thing is exactly what I made Engine to support; bravo. I think putting the opt-in in the user's hands for unsafe is a reasonable path. It could also be put in a separate crate but that seems like perhaps more hassle than it's worth.

I'm terribly busy so who knows when I'll have time to give this a close look, but it's definitely something I want to support.

marshallpierce · 2021-09-16T00:23:40Z

Oh, and this might be the sort of thing that would be worth doing #98 for.

dequbed · 2021-09-16T18:17:10Z

So, with 7039695 I have a first version I'd be okay with being merged¹. I finished implementing the url-safe alphabet, and I am reasonable certain that the code is correct in the sense that it's well behaved to memory, in the sense that it produces correct input data and output data, and in the sense that it's decent enough looking and well behaved enough in company that I wouldn't be ashamed to be seen at it's side on a party.

Additionally I skewed the ratio of lines-of-indecipherable-simd-intrinsics to comments-explaining-the-vodoo somewhat in my favour. That is, future me's favour, for when I'm sitting in front of it again in 5 years time and have to figure our what it's doing and why it has stopped working.

¹: I'd sincerely prefer to first squash the history into something more reasonable though.

marshallpierce · 2021-09-23T12:50:45Z

Thanks. It'll take me at least a couple weeks to get to this but I will get there!

dequbed · 2021-09-23T20:44:19Z

Thanks. It'll take me at least a couple weeks to get to this but I will get there!

No worries, the world has made it this far without AVX2 accelerated base64 in Rust, I think it'll limp along at least a few more weeks :)

I'm also still working on this PR here and there. The comments aren't up to the level of neither quality nor quantity I'd like them to be on, and the entire thing is plain and simple not fast enough yet.
Additionaly; fuzzing. You are right that getting this crate on oss-fuzz would be a smart move. Whatever comes from that will be a separate PR however since fuzzing is not about AVX2 and AVX2 is not about fuzzing.

So no worries and take your time. Neither the code nor me is going anywhere fast and the worst that could happen is that you'll find a more polished, more refined PR that what's here now. ^^

The paper this algorithm is taken from can be found via DOI 10.1145/3132709 arXiv:1704.00605 or via arXiv.org: https://arxiv.org/abs/1704.00605

This adds the AVX2Encoder engine to the existing benchmarks and tests and introduces three new fuzzing targets: - roundtrip_avx - decode_random_avx which behave very similar to the non-avx versions, and - roundtrip_avx_equivalent Which checks that the avx2 engine and the DEFAULT_ENGINE produce the same data both ways.

dequbed · 2022-01-19T16:36:03Z

Hello!

I recently saw that you were preparing for a new release of base64 and that got me motivated to clean up the PR a bit.
So, the PR is now rebased onto master and I squashed it down into three commits:
One adding the actual engine code, one importing it into the module tree and a last one adding integration tests, benchmarks and fuzzing targets. This way if something breaks you can just revert a single commit that did the breaking.

No worries if you don't have time to review the PR just yet, I haven't found the time to really sit down and tune the engine either.
This is really just picking up the PR, wiping of the dust and putting it back right the way it was before.

EDIT: Okay it's ~~four~~ five! commits now, ~~I didn't cargo fmt the code and that made CircleCI unhappy ^^'~~ but CircleCI is happy and a happy CI a happy dequbed makes :P

If the CircleCI runner allows us to use AVX2 extensions this will use them readily, if it doesn't the fuzz_target! will compile but do absolutely nothing.

marshallpierce · 2022-01-24T21:30:16Z

Thanks for maintaining this. I will get to it eventually... tough to find a big block of uninterrupted, unallocated time.

dequbed · 2022-01-25T01:30:59Z

Hey no worries! ^^

Not having AVX2-accelerated base64 in Rust isn't going to make the world go up in flames.
Really, the worst thing that could happen if this sits for too long is that I send in a few more PRs to garner your trust, become co-maintainer, and merge it myself! :P

dequbed · 2022-01-25T16:52:55Z

I wanted to wait discussing that until later, but I think it's a good idea to finalize this PR into an integrated package instead of a more fleshed out PoC, so:

How do we want to handle engine selection?

I think an user enabling an avx2 or other SIMD feature would like to use the fastest engine available, so the current situation of DEFAULT_ENGINE and thus also base64::{encode, decode} not making use of hardware-accelerated engines is in my opinion more surprising that the alternative.

However, that only works for users that also compile locally and with -C target-cpu=native which e.g. excludes users of binary distributions.
To allow those users to use fast engines runtime detection would have to be added.

Rust brings the macro std::is_x86_feature_detected (and similar for other architectures) so this could be implemented without adding dependencies.

I would like to see runtime detection for the shorthand methods like base64::encode making them use the fastest engine available in the runtime environment. Of course the detection using the macro wouldn't work in no_std environments but I think those want more explicit control and would prefer to use encode_engine anyway.

If you think that would be a sensible addition I can either add it to this PR or open a new one only adding the detection system that could be more easily merged before an upcoming 0.20 release, whichever you prefer.

marshallpierce · 2022-02-08T15:51:08Z

To make it easier to review in my fragments of OSS time, let's do the feature detection separately.

I'm of two minds about the nature of DEFAULT_ENGINE. Default might mean "Always a safe choice", and if you're using Rust because you prioritize safety, that's probably an assumption that a lot of people have. For them, the default being a unsafe-free engine makes sense. OTOH, your AVX2-using, bandwidth-saturating, chip-melting performance enthusiast has different priorities.

Along those lines, what if we expressly catered to the two (or more) priorities by steering users accordingly? Maybe we should have a FASTEST_ENGINE that resolves to the fastest impl supported on the hardware in that moment, and a ???_ENGINE (with some name that is suitably attractive without being scary) that works everywhere (and is unsafe-less), and then users can pick according to their priorities without worrying about the details too much.

marshallpierce · 2022-02-08T15:52:47Z

For that matter I don't like the engine name FastPortable either; I just haven't found something that appropriately characterizes "works on all CPUs and is moderately optimized without being unsafe or otherwise adventurous".

dequbed · 2022-02-23T14:35:48Z

To make it easier to review in my fragments of OSS time, let's do the feature detection separately.

Sounds good. Do you want to have that PR available now or should I wait working on that after we have merged this one?

I'm of two minds about the nature of DEFAULT_ENGINE. Default might mean "Always a safe choice", and if you're using Rust because you prioritize safety, that's probably an assumption that a lot of people have. For them, the default being a unsafe-free engine makes sense. OTOH, your AVX2-using, bandwidth-saturating, chip-melting performance enthusiast has different priorities.

I agree. Rust caters to many different groups, especially to both performance-sensitive and safety-sensitive at the same time. And I think this crate is fundamental enough that we should keep all of those groups in our heads.

Along those lines, what if we expressly catered to the two (or more) priorities by steering users accordingly? Maybe we should have a FASTEST_ENGINE that resolves to the fastest impl supported on the hardware in that moment, and a ???_ENGINE (with some name that is suitably attractive without being scary) that works everywhere (and is unsafe-less), and then users can pick according to their priorities without worrying about the details too much.
For that matter I don't like the engine name FastPortable either; I just haven't found something that appropriately characterizes "works on all CPUs and is moderately optimized without being unsafe or otherwise adventurous".

I think we want to explicitly use a slow engine so at least the "Fast" in FastPortable is redundant ;)
So maybe have a FASTEST_ENGINE and a PORTABLE_ENGINE with docs explaining that the former is the fastest available engine for the feature/compilation options/architecture which may include SIMD-accelerated ones and the latter one is code that's wholly architecture-independent and thus should behave the same way on all platforms and architectures.

Especially with portable-simd the "fastest engine" could be written without any unsafe but be very much unportable and lead to accidental undefined behaviour, e.g. when an application is compiled with target-cpu=native but run on a CPU without AVX2 or NEON available — FASTEST_ENGINE would lead to undefined behaviour, SIGILL and other "fun" errors, PORTABLE_ENGINE always* works. In that way, using FASTEST_ENGINE is an explicit opt-in to take extra care that the target platform is defined correctly, independent of any features (which you may not fully control since feature unification is a thing!)

*Terms and conditions apply :P

marshallpierce · 2022-03-01T12:03:21Z

Let's hold off on feature detection for now so I don't build up too embarrassingly large a PR backlog.

I think a "portable engine" and "fastest available" engine makes sense, though it does have a wrinkle: unlike the portable engine, which is a known type and could be used directly by the developer, the fastest engine would have to be constructed via an intervening function to handle resolving the correct CPU features, etc. To make constructing an unknown engine viable, should we force engines to be constructable from an alphabet and a config? If so, then the config type would need to be shared across engines. That's simpler for users, but perhaps harder for engines. Did supporting optional padding or lenient encoding feel like an undue burden?

dyc3 · 2023-07-10T15:30:33Z

src/engine/mod.rs

@@ -4,6 +4,13 @@ use crate::{alphabet, DecodeError};

 pub mod fast_portable;

+#[cfg(all(
+    any(target_arch = "x86", target_arch = "x86_64"),
+    target_feature = "avx2",


Specifying a strict target_feature here might prevent library users from doing dynamic cpu feature detection.

It would probably be better to tell the compiler that it's OK to generate avx2 instructions here using:

#[target_feature(enable = "avx2")]

dequbed marked this pull request as ready for review September 16, 2021 18:17

dequbed added 3 commits January 19, 2022 17:01

Implement an AVX2 alg based on Wojciech Muła & Daniel Lemire 2017

3caf7a8

The paper this algorithm is taken from can be found via DOI 10.1145/3132709 arXiv:1704.00605 or via arXiv.org: https://arxiv.org/abs/1704.00605

Import the AVX2 engine into the module tree

1371051

dequbed force-pushed the avx2 branch from 7039695 to 22163c0 Compare January 19, 2022 16:31

dequbed added 2 commits January 19, 2022 17:40

cargo fmt the everythings

80b422f

Make the fuzzing targets not error on CircleCI

b88920f

If the CircleCI runner allows us to use AVX2 extensions this will use them readily, if it doesn't the fuzz_target! will compile but do absolutely nothing.

marshallpierce mentioned this pull request Feb 8, 2022

add plantuml character set #174

Closed

marshallpierce mentioned this pull request May 4, 2022

Path to v1.0? #130

Open

marshallpierce mentioned this pull request Jun 26, 2022

perf(decode): optimize Vec alloc/resize #179

Open

dyc3 reviewed Jul 10, 2023

View reviewed changes

marshallpierce mentioned this pull request Jan 13, 2024

SIMD support? #266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement an AVX2 accelerated Engine #170

Implement an AVX2 accelerated Engine #170

dequbed commented Sep 15, 2021

marshallpierce commented Sep 16, 2021

marshallpierce commented Sep 16, 2021

dequbed commented Sep 16, 2021 •

edited

Loading

marshallpierce commented Sep 23, 2021

dequbed commented Sep 23, 2021

dequbed commented Jan 19, 2022 •

edited

Loading

marshallpierce commented Jan 24, 2022

dequbed commented Jan 25, 2022

dequbed commented Jan 25, 2022 •

edited

Loading

marshallpierce commented Feb 8, 2022

marshallpierce commented Feb 8, 2022

dequbed commented Feb 23, 2022 •

edited

Loading

marshallpierce commented Mar 1, 2022

dyc3 Jul 10, 2023

Implement an AVX2 accelerated Engine #170

Are you sure you want to change the base?

Implement an AVX2 accelerated Engine #170

Conversation

dequbed commented Sep 15, 2021

What is this about?

What's the catch?

So, what now?

Okay, one last question: … But why?

marshallpierce commented Sep 16, 2021

marshallpierce commented Sep 16, 2021

dequbed commented Sep 16, 2021 • edited Loading

marshallpierce commented Sep 23, 2021

dequbed commented Sep 23, 2021

dequbed commented Jan 19, 2022 • edited Loading

marshallpierce commented Jan 24, 2022

dequbed commented Jan 25, 2022

dequbed commented Jan 25, 2022 • edited Loading

How do we want to handle engine selection?

marshallpierce commented Feb 8, 2022

marshallpierce commented Feb 8, 2022

dequbed commented Feb 23, 2022 • edited Loading

*Terms and conditions apply :P

marshallpierce commented Mar 1, 2022

dyc3 Jul 10, 2023

Choose a reason for hiding this comment

dequbed commented Sep 16, 2021 •

edited

Loading

dequbed commented Jan 19, 2022 •

edited

Loading

dequbed commented Jan 25, 2022 •

edited

Loading

dequbed commented Feb 23, 2022 •

edited

Loading