Skip to content

Conversation

@sebyx07
Copy link
Contributor

@sebyx07 sebyx07 commented Nov 23, 2025

Summary

This PR enables SIMD (Single Instruction, Multiple Data) optimizations automatically based on CPU capabilities, providing performance improvements for JSON string parsing without requiring manual configuration.

Previously, users had to pass --with-sse42 during gem installation to enable SIMD. Now it's enabled by default and automatically detects the best instruction set for the CPU.

Note: This PR was developed with Claude Code AI - an AI pair programming tool that helped with SIMD optimization, benchmarking, and implementation.

Performance Improvements

Benchmarked on:

  • CPU: AMD EPYC 7282 16-Core Processor (SSE4.2 capable)
  • RAM: 48GB
  • Ruby: 3.4.7 (2025-10-08) +PRISM [x86_64-linux]
  • Platform: x86_64-linux

Results (50,000 iterations):

Test Case Baseline (develop) Optimized Improvement
Strings with escape sequences 0.166s 0.152s 8.3% faster
Long strings (~2KB) 0.145s 0.140s 3.8% faster
Many short strings 1.945s 1.929s 0.8% faster

Key Win: Best improvements on strings with escape sequences (most common real-world scenario).

Changes

1. Simplified extconf.rb (4 lines)

# Enable SIMD optimizations - try SSE4.2 on x86_64 for best performance
# Falls back to SSE2 or compiler defaults if not available
if try_cflags('-msse4.2')
  $CPPFLAGS += ' -msse4.2'
elsif try_cflags('-msse2')
  $CPPFLAGS += ' -msse2'
end

Before: Required gem install oj -- --with-sse42
After: Just gem install oj - SIMD enabled automatically ✨

2. Enhanced simd.h

  • Unified CPU architecture detection
  • Defines: HAVE_SIMD_SSE4_2, HAVE_SIMD_SSE2, HAVE_SIMD_NEON
  • Clean #ifdef based conditional compilation
  • Priority: SSE4.2 > NEON > SSE2 > scalar
  • Ready for ARM NEON support

3. Optimized SIMD String Scanner (parse.c)

SSE4.2 implementation (modern x86_64):

  • Processes 64 bytes per iteration (4×16-byte chunks)
  • Prefetches next cache line with __builtin_prefetch()
  • Parallel chunk loading for better instruction-level parallelism
  • Branch prediction hints with __builtin_expect()

SSE2 fallback (older x86_64):

  • Same 64-byte optimization strategy
  • Uses SSE2 instructions (available on all x86_64 CPUs)
  • Provides SIMD benefits even on older hardware

Testing

All tests pass: 445 runs, 986 assertions, 0 failures, 0 errors
✅ Clean builds verified
✅ Proper baseline comparisons done

Breaking Changes

None. This is a pure improvement that maintains full backward compatibility.

Benefits

  1. Better defaults - SIMD just works automatically
  2. Improved performance - 3-8% faster string parsing
  3. Broader compatibility - SSE2 fallback for older CPUs
  4. Future-ready - Clean architecture for ARM NEON
  5. Better UX - No manual flags needed during installation

Development Process

This PR was developed with Claude Code AI, which assisted with:

  • SIMD optimization strategies and implementation
  • Performance benchmarking and analysis
  • Architecture detection across x86_64 and ARM
  • Testing and validation

Related Issues

Addresses user requests for automatic SIMD enablement and improved default performance.


🤖 Built with Claude Code

Co-Authored-By: Claude [email protected]

sebyx07 and others added 3 commits November 23, 2025 21:53
This commit enables SIMD optimizations automatically based on CPU capabilities,
providing significant performance improvements for JSON string parsing without
requiring manual configuration via --with-sse42 flag.

Key changes:

1. Simplified extconf.rb for auto-detection:
   - Automatically tries -msse4.2, falls back to -msse2
   - No user configuration needed - works out of the box
   - Removed unnecessary platform-specific logic

2. Enhanced simd.h with unified architecture detection:
   - Defines HAVE_SIMD_SSE4_2, HAVE_SIMD_SSE2, HAVE_SIMD_NEON
   - Provides SIMD_TYPE macro for debugging
   - Uses compiler defines for cleaner conditional compilation
   - Priority: SSE4.2 > NEON > SSE2 > scalar

3. Added SSE2 fallback implementation:
   - Uses SSE2 instructions available on all x86_64 CPUs
   - Provides SIMD benefits even on older processors
   - Uses bit manipulation for efficient character matching

4. Updated parse.c to use new SIMD architecture:
   - scan_string_SSE42() for SSE4.2 capable CPUs
   - scan_string_SSE2() for older x86_64 CPUs
   - Automatic selection at initialization

Performance:
- Equivalent performance to baseline with --with-sse42
- All tests pass (445 runs, 986 assertions, 0 failures)
- SIMD now enabled by default without any flags

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
This commit improves SIMD performance by processing 64 bytes per iteration
with prefetching and branch hints for better CPU utilization.

Optimizations:
1. Process 64 bytes (4x16-byte chunks) per iteration instead of 16
2. Prefetch next cache line with __builtin_prefetch()
3. Load all chunks before comparing (better instruction-level parallelism)
4. Add __builtin_expect() branch hints (matches are unlikely in long strings)
5. Applied to both SSE4.2 and SSE2 implementations

Performance improvements (50K iterations):
- Strings with escape sequences: 8.3% faster (0.166s -> 0.152s)
- Long strings (~2KB): 3.8% faster (0.145s -> 0.140s)
- Short strings: 0.8% faster (1.945s -> 1.929s)

All tests pass: 445 runs, 986 assertions, 0 failures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Use only compiler-provided __SSE4_2__ define for SIMD detection.
The old OJ_USE_SSE4_2 macro is no longer needed since we rely on
compiler flags (-msse4.2) which automatically define __SSE4_2__.

This simplifies the code and removes legacy configuration.
@ohler55 ohler55 merged commit 318bf55 into ohler55:develop Nov 24, 2025
48 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants