Skip to content

Conversation

@samyron
Copy link
Contributor

@samyron samyron commented Jun 11, 2025

Hello! Me again.. this time optimizing oj_dump_cstr on x86-64 platforms using SSE4.2 and SSSE3.

This PR is not yet ready for a thorough review but I wanted to get the discussion started which direction to the implementation.

There is still a bit of work to be done, particularly around reducing the duplication between the X86-64/AMD64 and ARM64 SIMD implementations. I haven't spent too much time trying to reduce this duplication. Additionally this only supports the :compat mode but :rails should be quick to support.

I noticed in the oj/ext/extconf.rb there is an option --with-sse42 to compile SSE4.2 support. It looks like this is currently only used in the parser. This PR, however, currently uses runtime instruction set detection to determine if it can use the SSE4.2/SSSE3 functions. I use the __builtin_cpu_supports function which is supported by GCC and clang to determine if the CPU supports the necessary instructions. I'm happy to continue down the path to support runtime CPU detection or switch to compile time support. The benefit of runtime CPU detection is anyone receiving a binary distribution of this library gets the SSE4.2 support so long as the platform compiling the library can compile SSE4.2/SSSE3 instructions (and currently using either GCC or clang). Additionally, consumers of the library don't need to provide any configure options to get the support. However, I'm not set on this approach. If you'd rather me switch this to an explicit compiler flag and/or use the existing --with-sse42 flag, I'm happy to do so.

Additionally, even with runtime instruction set detection, I can provide a compiler flag to disable the feature (or enable.. depends if we want opt-in or opt-out).

Update 2025/06/25

As of commit 37dc86450aa833bf8c2ab768b573b0e28f0b1f95 the SSE4.2/SSSE3 code is behind the --with-sse42 flag. The benchmarks below have been updated too.

Benchmarks

CPU: Intel(R) Core(TM) i7-8850H

Real world benchmarks

develop commit: 2e95f15d9207c18a4ee1eccfb1a2259ccda9c3a8
optimize-oj_dump_cstr-sse4 commit: 37dc86450aa833bf8c2ab768b573b0e28f0b1f95

== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after     1.257k i/100ms
Calculating -------------------------------------
               after     12.603k (± 5.0%) i/s   (79.35 μs/i) -     62.850k in   5.001904s

Comparison:
              before:     9014.7 i/s
               after:    12602.6 i/s - 1.40x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after    72.000 i/100ms
Calculating -------------------------------------
               after    652.720 (± 8.9%) i/s    (1.53 ms/i) -      3.312k in   5.119064s

Comparison:
              before:      703.9 i/s
               after:      652.7 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after   137.000 i/100ms
Calculating -------------------------------------
               after      1.387k (± 6.1%) i/s  (721.05 μs/i) -      6.987k in   5.058261s

Comparison:
              before:     1224.0 i/s
               after:     1386.9 i/s - same-ish: difference falls within error


== Encoding ohai.json (20145 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after     1.864k i/100ms
Calculating -------------------------------------
               after     20.398k (± 6.4%) i/s   (49.02 μs/i) -    102.520k in   5.050563s

Comparison:
              before:    20667.6 i/s
               after:    20398.2 i/s - same-ish: difference falls within error

Synthetic (happiest-of-happy paths - but still needs a single escape)

# benchmark_encoding "bytes.128.single-escape-at-end", ([(('a' * 127 + '\n'))] * 10000)

== Encoding bytes.128.single-escape-at-end (1330001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
               after    66.000 i/100ms
Calculating -------------------------------------
               after    756.543 (± 3.8%) i/s    (1.32 ms/i) -      3.828k in   5.067620s

Comparison:
              before:      232.3 i/s
               after:      756.5 i/s - 3.26x  faster

@ohler55
Copy link
Owner

ohler55 commented Jun 17, 2025

Is this still a WIP?

@samyron
Copy link
Contributor Author

samyron commented Jun 18, 2025

Is this still a WIP?

Yes.. I at least need to clean up the warnings. Two questions for you though:

  1. Are you okay you with runtime CPU detection to determine which method to use (on x86-64)? Or would you prefer using a compiler flag to enable/disable the feature?
  2. If using runtime CPU detection, would you like me to use that for the SSE4.2 code in the parser? If you would prefer the compile-time flag, would you like me to use the same --with-sse42 flag for this code too?

@ohler55
Copy link
Owner

ohler55 commented Jun 18, 2025

If the overhead of runtime is trivial then that would be fine but compile time is best if possible.

@samyron
Copy link
Contributor Author

samyron commented Jun 22, 2025

Apologies for the delays... it's been a busy week. I should be able to wrap this up within the next few days.

@ohler55
Copy link
Owner

ohler55 commented Jun 22, 2025

No worries. Same has happens to me on more than one occasion.

@samyron samyron marked this pull request as ready for review June 26, 2025 02:59
@samyron
Copy link
Contributor Author

samyron commented Jun 26, 2025

This should be ready for review. There is still a bit of duplication between the NEON and SSE4.2 code which can probably be made a bit more generic. I'm not entirely sure it's worth it but I'm happy to do so if you'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants