Releases: davidesantangelo/vsort
Releases · davidesantangelo/vsort
v1.0.0
Added
- Unified configuration API with
vsort_options_t, feature flags, and result codes - New
vsort_sortentry point plus helpers to query and set default behaviour - Stable merge sort mode and counting sort fast path for byte arrays
- Parallel introsort pipeline with chunked merging on Apple Silicon devices
Changed
- Replaced legacy quicksort driver with adaptive introsort + heapsort fallback
- Refactored runtime into a calibrated hardware profile with thread-safe init
- Simplified radix sort guardrails and nearly-sorted detection heuristics
- Default parallel and radix policies now configurable per-call
Fixed
- Prevented radix overflow on very large integer ranges
- Guaranteed deterministic comparator handling for generic dispatch
- Eliminated redundant memory allocations in merge passes
v0.5.0
Added
- Introduced significant performance optimizations via compiler flags:
-Ofast,-march=native,-ffast-math,-fomit-frame-pointeradded to release builds (CMake and build script).- Enabled Link-Time Optimization (LTO) (
-flto) for release builds. - Enabled Interprocedural Optimization (IPO) via CMake where supported.
- Added compiler-specific hints for potential performance gains:
VSORT_RESTRICT(__restrict__) for pointers.VSORT_LIKELY/VSORT_UNLIKELY(__builtin_expect) for branch prediction.inlineand__attribute__((hot))/VSORT_HOTfor critical functions.
- Introduced
VSORT_HOTmacro to manage__attribute__((hot))portability.
Changed
- Version bumped to 0.5.0.
- Enabled Apple Silicon optimizations (
USE_APPLE_SILICON_OPTIMIZATIONS) by default in CMake. - Updated and simplified CMake build steps in CI workflow (
.github/workflows/apple_silicon_ci.yml). - Updated build instructions in
README.mdto reflect current CMake usage. - Applied optimization flags to the standalone
build_script.sh. - Made internal usage of certain compiler attributes (
always_inline,hot) conditional for improved portability.
Fixed
- Improved build portability, especially for compilers like MSVC, by abstracting or conditionally compiling GCC/Clang-specific
__attribute__directives.
Removed
- Outdated
// TODO:comments from source code. - Development roadmap section from
README.md. - Unnecessary forward declarations for
swap_int/swap_floatinvsort.c.
v0.4.0
Added
- Parallel Merge Dispatch (GCD): Implemented parallel execution for merge passes within the parallel sorting algorithm (
vsort_parallel_int,vsort_parallel_float) on Apple platforms usingdispatch_apply. This replaces the previous sequential merge loop, significantly improving performance for large arrays by leveraging multiple cores during the merge phase.
Changed
vsort_floatOptimization: Removed the fallback to standardqsort.vsort_floatnow uses the internal adaptive sorting framework (parallel dispatch, quicksort, insertion sort) consistent withvsortfor integers.- Hardware Detection (macOS): Improved robustness and efficiency by replacing
popencalls with directsysctlbynamesystem calls for detecting CPU cores, cache sizes, and other hardware characteristics. - Benchmark Mergesort: Refactored the
custom_mergesortimplementation inexamples/benchmark.cto use a single, pre-allocated temporary buffer, making it more efficient and a fairer comparison againstvsort. - Documentation (
README.md): Significantly updated to accurately reflect the v0.4.0 implementation status, including the new parallel merge dispatch, NEON status (planned), revised technical details, and updated roadmap. Removed placeholder performance tables. - Quicksort Stack: Increased default stack capacity in iterative quicksort and improved safety checks.
Fixed
- Build Warnings/Errors: Resolved various compiler warnings identified in previous builds, including
MIN/MAXmacro redefinitions (renamed toVSORT_MIN/VSORT_MAX), unused variables (pushed,timing_errors), unused functions (P/E core distribution logic, default float comparator), and issues withposix_memalignpreprocessor checks. - Benchmark Example (
examples/benchmark.c): ReplacedINFINITYwithDBL_MAXto resolve warnings/issues related to the-ffast-mathflag. Movedsrandcall to the beginning ofmainfor proper seeding. Corrected algorithm name reference fromstd::sorttostd_sort. Improved result reporting logic.
v0.3.4
Added
- Improved P/E core utilization with intelligent workload distribution
- More precise detection of performance and efficiency cores on Apple Silicon
- Chunk complexity analysis to assign work to appropriate core types
- Separate dispatch queues with different QoS levels for different core types
Improved
- More accurate detection of P-core and E-core counts using sysctl
- Better parallel sorting performance through optimized core utilization
- Power efficiency by directing appropriate work to efficiency cores
v0.3.3
Added
- Dynamic threshold adjustment to auto-tune sorting parameters based on hardware
- Hardware detection system that identifies CPU model, cache sizes, and core types
- Adaptive calibration for sorting thresholds to optimize performance by hardware
- Enhanced CPU detection on both Apple Silicon and Linux/Windows platforms
Improved
- Better utilization of CPU cache hierarchy for optimal sorting performance
- Refined parallel threshold calculation based on available cores and cache sizes
- More intelligent P-core vs E-core utilization on Apple Silicon
- Improved documentation of hardware detection and calibration process
v0.3.2
Added
- CI testing on Apple Silicon hardware to validate NEON optimizations
- Logger system with configurable verbosity levels to replace printf debugging
Improved
- Removed all printf debugging statements from production code
- Better error handling with proper logging instead of console output
- Simplified code paths for improved maintainability
- Enhanced parallel sorting strategy with better sequential merge for reliability
- Added more thorough error detection and validation throughout sorting process
- Improved memory alignment handling for NEON SIMD operations
- Added robust Makefile for simpler build process on all platforms
- Properly marked unused functions with attributes to silence compiler warnings
Fixed
- Memory alignment issues in SIMD operations
- CI pipeline now tests on relevant hardware platforms
- Code cleanup for production readiness
- CMake build errors with
-mfpu=neonflag on Apple Silicon platforms - Corrected boundary checking in parallel merge algorithm to prevent sorting failures
- Addressed potential memory issues in performance tests with large arrays
- Improved verification checks to detect and recover from incorrect merges
- Fixed compiler warnings about unused functions in the codebase
v0.3.1
Added
- Complete NEON vectorization implementation for partition operations.
- Enhanced vectorized merge operation with optimized fast paths for homogeneous data segments.
- Specialized vector processing for different data patterns.
Improved
- Partition function now processes 4 elements at once using NEON SIMD.
- Merge function utilizes vectorized comparisons and bulk transfers.
- Better detection of sequential data patterns for faster processing.
- Significant performance boost especially for large datasets and reverse-sorted data.
- Updated benchmark results demonstrating up to 4.4x speedup for reverse-sorted data and up to 5.94x faster sorting than
std::qsort.
v0.3.0
Added
- Improved performance for Apple Silicon with P-core and E-core detection
- Adaptive chunk sizing based on cache characteristics
- New parallel merge function with vectorization support
- Work-stealing queue structure for better load balancing
Changed
- Completely redesigned parallel sorting implementation
- Enhanced thread allocation strategy based on array size and core types
- Improved workload distribution for better performance
- Better balancing between cache efficiency and parallelism
- More efficient parallel merging using a balanced binary tree approach
Fixed
- Memory management in parallel sorting operation
- Potential performance bottlenecks in larger arrays
v0.2.1
Fixed
- Cross-platform compatibility with Windows systems
- Memory allocation handling in parallel sorting algorithm
- Build system improvements for cross-platform compilation
- CMake configuration for Visual Studio compiler flags
Improved
- Error handling in quicksort algorithm
- Thread allocation for platforms without Grand Central Dispatch
- Cache utilization in sorting algorithm
- Documentation for Windows compatibility
v0.2.0
Added
- Enhanced NEON vectorization that processes 4 integers simultaneously
- Adaptive algorithm selection that detects data patterns for optimal sorting
- Custom thread management based on array size and Apple Silicon core characteristics
- Specialized fast-path optimizations for common data patterns
- Benchmark suite for comparison against standard sorting algorithms
- Documentation for performance characteristics and optimization techniques
Improved
- Performance on large arrays (up to 30% faster for 1M+ elements)
- Memory efficiency through optimized cache line alignment
- Handling of nearly-sorted and reverse-sorted data, now up to 3.5x faster
- Work distribution across Apple Silicon's heterogeneous cores
- Branch prediction by optimizing decision paths in sorting algorithms
- Compiler optimizations specific to Apple Silicon
Fixed
- Memory access patterns to reduce cache misses on Apple Silicon
- Thread allocation issues on heavily loaded systems
- Performance degradation on certain data distribution patterns
- Inconsistent behavior with large arrays (>2M elements)
Performance
- Random data: Outperforms quicksort by up to 12%
- Nearly-sorted data: Up to 40% performance improvement
- Reverse-sorted data: Up to 3.5x faster processing
- Large arrays: Memory overhead reduced by 15% compared to previous version
Documentation
- Added detailed benchmarks and performance comparison tables
- Expanded explanation of Apple Silicon optimizations
- Improved usage examples and API documentation
- Added troubleshooting section for common issues