-
-
Notifications
You must be signed in to change notification settings - Fork 263
Optimize oj_dump_cstr using SSE4.2 and SSSE3.
#973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
… the worst case synthetic benchmarks.
|
Is this still a WIP? |
Yes.. I at least need to clean up the warnings. Two questions for you though:
|
|
If the overhead of runtime is trivial then that would be fine but compile time is best if possible. |
|
Apologies for the delays... it's been a busy week. I should be able to wrap this up within the next few days. |
|
No worries. Same has happens to me on more than one occasion. |
Optimize oj dump cstr sse4 refactor
|
This should be ready for review. There is still a bit of duplication between the NEON and SSE4.2 code which can probably be made a bit more generic. I'm not entirely sure it's worth it but I'm happy to do so if you'd like. |
Hello! Me again.. this time optimizing
oj_dump_cstron x86-64 platforms using SSE4.2 and SSSE3.This PR is not yet ready for a thorough review but I wanted to get the discussion started which direction to the implementation.
There is still a bit of work to be done, particularly around reducing the duplication between the X86-64/AMD64 and ARM64 SIMD implementations. I haven't spent too much time trying to reduce this duplication. Additionally this only supports the
:compatmode but:railsshould be quick to support.I noticed in the
oj/ext/extconf.rbthere is an option--with-sse42to compile SSE4.2 support. It looks like this is currently only used in the parser. This PR, however, currently uses runtime instruction set detection to determine if it can use the SSE4.2/SSSE3 functions. I use the __builtin_cpu_supports function which is supported by GCC and clang to determine if the CPU supports the necessary instructions. I'm happy to continue down the path to support runtime CPU detection or switch to compile time support. The benefit of runtime CPU detection is anyone receiving a binary distribution of this library gets the SSE4.2 support so long as the platform compiling the library can compile SSE4.2/SSSE3 instructions (and currently using either GCC or clang). Additionally, consumers of the library don't need to provide any configure options to get the support. However, I'm not set on this approach. If you'd rather me switch this to an explicit compiler flag and/or use the existing--with-sse42flag, I'm happy to do so.Additionally, even with runtime instruction set detection, I can provide a compiler flag to disable the feature (or enable.. depends if we want opt-in or opt-out).
Update 2025/06/25
As of commit
37dc86450aa833bf8c2ab768b573b0e28f0b1f95the SSE4.2/SSSE3 code is behind the--with-sse42flag. The benchmarks below have been updated too.Benchmarks
CPU: Intel(R) Core(TM) i7-8850H
Real world benchmarks
developcommit:2e95f15d9207c18a4ee1eccfb1a2259ccda9c3a8optimize-oj_dump_cstr-sse4commit:37dc86450aa833bf8c2ab768b573b0e28f0b1f95Synthetic (happiest-of-happy paths - but still needs a single escape)