Skip to content

Releases: davidesantangelo/vsort

v1.0.0

04 Nov 14:56

Choose a tag to compare

Added

  • Unified configuration API with vsort_options_t, feature flags, and result codes
  • New vsort_sort entry point plus helpers to query and set default behaviour
  • Stable merge sort mode and counting sort fast path for byte arrays
  • Parallel introsort pipeline with chunked merging on Apple Silicon devices

Changed

  • Replaced legacy quicksort driver with adaptive introsort + heapsort fallback
  • Refactored runtime into a calibrated hardware profile with thread-safe init
  • Simplified radix sort guardrails and nearly-sorted detection heuristics
  • Default parallel and radix policies now configurable per-call

Fixed

  • Prevented radix overflow on very large integer ranges
  • Guaranteed deterministic comparator handling for generic dispatch
  • Eliminated redundant memory allocations in merge passes

v0.5.0

17 Apr 14:29

Choose a tag to compare

Added

  • Introduced significant performance optimizations via compiler flags:
    • -Ofast, -march=native, -ffast-math, -fomit-frame-pointer added to release builds (CMake and build script).
    • Enabled Link-Time Optimization (LTO) (-flto) for release builds.
    • Enabled Interprocedural Optimization (IPO) via CMake where supported.
  • Added compiler-specific hints for potential performance gains:
    • VSORT_RESTRICT (__restrict__) for pointers.
    • VSORT_LIKELY/VSORT_UNLIKELY (__builtin_expect) for branch prediction.
    • inline and __attribute__((hot)) / VSORT_HOT for critical functions.
  • Introduced VSORT_HOT macro to manage __attribute__((hot)) portability.

Changed

  • Version bumped to 0.5.0.
  • Enabled Apple Silicon optimizations (USE_APPLE_SILICON_OPTIMIZATIONS) by default in CMake.
  • Updated and simplified CMake build steps in CI workflow (.github/workflows/apple_silicon_ci.yml).
  • Updated build instructions in README.md to reflect current CMake usage.
  • Applied optimization flags to the standalone build_script.sh.
  • Made internal usage of certain compiler attributes (always_inline, hot) conditional for improved portability.

Fixed

  • Improved build portability, especially for compilers like MSVC, by abstracting or conditionally compiling GCC/Clang-specific __attribute__ directives.

Removed

  • Outdated // TODO: comments from source code.
  • Development roadmap section from README.md.
  • Unnecessary forward declarations for swap_int/swap_float in vsort.c.

v0.4.0

03 Apr 14:15

Choose a tag to compare

Added

  • Parallel Merge Dispatch (GCD): Implemented parallel execution for merge passes within the parallel sorting algorithm (vsort_parallel_int, vsort_parallel_float) on Apple platforms using dispatch_apply. This replaces the previous sequential merge loop, significantly improving performance for large arrays by leveraging multiple cores during the merge phase.

Changed

  • vsort_float Optimization: Removed the fallback to standard qsort. vsort_float now uses the internal adaptive sorting framework (parallel dispatch, quicksort, insertion sort) consistent with vsort for integers.
  • Hardware Detection (macOS): Improved robustness and efficiency by replacing popen calls with direct sysctlbyname system calls for detecting CPU cores, cache sizes, and other hardware characteristics.
  • Benchmark Mergesort: Refactored the custom_mergesort implementation in examples/benchmark.c to use a single, pre-allocated temporary buffer, making it more efficient and a fairer comparison against vsort.
  • Documentation (README.md): Significantly updated to accurately reflect the v0.4.0 implementation status, including the new parallel merge dispatch, NEON status (planned), revised technical details, and updated roadmap. Removed placeholder performance tables.
  • Quicksort Stack: Increased default stack capacity in iterative quicksort and improved safety checks.

Fixed

  • Build Warnings/Errors: Resolved various compiler warnings identified in previous builds, including MIN/MAX macro redefinitions (renamed to VSORT_MIN/VSORT_MAX), unused variables (pushed, timing_errors), unused functions (P/E core distribution logic, default float comparator), and issues with posix_memalign preprocessor checks.
  • Benchmark Example (examples/benchmark.c): Replaced INFINITY with DBL_MAX to resolve warnings/issues related to the -ffast-math flag. Moved srand call to the beginning of main for proper seeding. Corrected algorithm name reference from std::sort to std_sort. Improved result reporting logic.

v0.3.4

23 Mar 19:43

Choose a tag to compare

Added

  • Improved P/E core utilization with intelligent workload distribution
  • More precise detection of performance and efficiency cores on Apple Silicon
  • Chunk complexity analysis to assign work to appropriate core types
  • Separate dispatch queues with different QoS levels for different core types

Improved

  • More accurate detection of P-core and E-core counts using sysctl
  • Better parallel sorting performance through optimized core utilization
  • Power efficiency by directing appropriate work to efficiency cores

v0.3.3

20 Mar 08:01

Choose a tag to compare

Added

  • Dynamic threshold adjustment to auto-tune sorting parameters based on hardware
  • Hardware detection system that identifies CPU model, cache sizes, and core types
  • Adaptive calibration for sorting thresholds to optimize performance by hardware
  • Enhanced CPU detection on both Apple Silicon and Linux/Windows platforms

Improved

  • Better utilization of CPU cache hierarchy for optimal sorting performance
  • Refined parallel threshold calculation based on available cores and cache sizes
  • More intelligent P-core vs E-core utilization on Apple Silicon
  • Improved documentation of hardware detection and calibration process

v0.3.2

19 Mar 15:58

Choose a tag to compare

Added

  • CI testing on Apple Silicon hardware to validate NEON optimizations
  • Logger system with configurable verbosity levels to replace printf debugging

Improved

  • Removed all printf debugging statements from production code
  • Better error handling with proper logging instead of console output
  • Simplified code paths for improved maintainability
  • Enhanced parallel sorting strategy with better sequential merge for reliability
  • Added more thorough error detection and validation throughout sorting process
  • Improved memory alignment handling for NEON SIMD operations
  • Added robust Makefile for simpler build process on all platforms
  • Properly marked unused functions with attributes to silence compiler warnings

Fixed

  • Memory alignment issues in SIMD operations
  • CI pipeline now tests on relevant hardware platforms
  • Code cleanup for production readiness
  • CMake build errors with -mfpu=neon flag on Apple Silicon platforms
  • Corrected boundary checking in parallel merge algorithm to prevent sorting failures
  • Addressed potential memory issues in performance tests with large arrays
  • Improved verification checks to detect and recover from incorrect merges
  • Fixed compiler warnings about unused functions in the codebase

v0.3.1

19 Mar 08:04

Choose a tag to compare

Added

  • Complete NEON vectorization implementation for partition operations.
  • Enhanced vectorized merge operation with optimized fast paths for homogeneous data segments.
  • Specialized vector processing for different data patterns.

Improved

  • Partition function now processes 4 elements at once using NEON SIMD.
  • Merge function utilizes vectorized comparisons and bulk transfers.
  • Better detection of sequential data patterns for faster processing.
  • Significant performance boost especially for large datasets and reverse-sorted data.
  • Updated benchmark results demonstrating up to 4.4x speedup for reverse-sorted data and up to 5.94x faster sorting than std::qsort.

v0.3.0

18 Mar 14:49

Choose a tag to compare

Added

  • Improved performance for Apple Silicon with P-core and E-core detection
  • Adaptive chunk sizing based on cache characteristics
  • New parallel merge function with vectorization support
  • Work-stealing queue structure for better load balancing

Changed

  • Completely redesigned parallel sorting implementation
  • Enhanced thread allocation strategy based on array size and core types
  • Improved workload distribution for better performance
  • Better balancing between cache efficiency and parallelism
  • More efficient parallel merging using a balanced binary tree approach

Fixed

  • Memory management in parallel sorting operation
  • Potential performance bottlenecks in larger arrays

v0.2.1

17 Mar 14:52

Choose a tag to compare

Fixed

  • Cross-platform compatibility with Windows systems
  • Memory allocation handling in parallel sorting algorithm
  • Build system improvements for cross-platform compilation
  • CMake configuration for Visual Studio compiler flags

Improved

  • Error handling in quicksort algorithm
  • Thread allocation for platforms without Grand Central Dispatch
  • Cache utilization in sorting algorithm
  • Documentation for Windows compatibility

v0.2.0

17 Mar 13:06

Choose a tag to compare

Added

  • Enhanced NEON vectorization that processes 4 integers simultaneously
  • Adaptive algorithm selection that detects data patterns for optimal sorting
  • Custom thread management based on array size and Apple Silicon core characteristics
  • Specialized fast-path optimizations for common data patterns
  • Benchmark suite for comparison against standard sorting algorithms
  • Documentation for performance characteristics and optimization techniques

Improved

  • Performance on large arrays (up to 30% faster for 1M+ elements)
  • Memory efficiency through optimized cache line alignment
  • Handling of nearly-sorted and reverse-sorted data, now up to 3.5x faster
  • Work distribution across Apple Silicon's heterogeneous cores
  • Branch prediction by optimizing decision paths in sorting algorithms
  • Compiler optimizations specific to Apple Silicon

Fixed

  • Memory access patterns to reduce cache misses on Apple Silicon
  • Thread allocation issues on heavily loaded systems
  • Performance degradation on certain data distribution patterns
  • Inconsistent behavior with large arrays (>2M elements)

Performance

  • Random data: Outperforms quicksort by up to 12%
  • Nearly-sorted data: Up to 40% performance improvement
  • Reverse-sorted data: Up to 3.5x faster processing
  • Large arrays: Memory overhead reduced by 15% compared to previous version

Documentation

  • Added detailed benchmarks and performance comparison tables
  • Expanded explanation of Apple Silicon optimizations
  • Improved usage examples and API documentation
  • Added troubleshooting section for common issues