Releases · davidesantangelo/vsort

04 Nov 14:56

davidesantangelo

v1.0.0

1cec071

v1.0.0 Latest

Latest

Added

Unified configuration API with vsort_options_t, feature flags, and result codes
New vsort_sort entry point plus helpers to query and set default behaviour
Stable merge sort mode and counting sort fast path for byte arrays
Parallel introsort pipeline with chunked merging on Apple Silicon devices

Changed

Replaced legacy quicksort driver with adaptive introsort + heapsort fallback
Refactored runtime into a calibrated hardware profile with thread-safe init
Simplified radix sort guardrails and nearly-sorted detection heuristics
Default parallel and radix policies now configurable per-call

Fixed

Prevented radix overflow on very large integer ranges
Guaranteed deterministic comparator handling for generic dispatch
Eliminated redundant memory allocations in merge passes

Assets 2

17 Apr 14:29

davidesantangelo

v0.5.0

2abd90f

v0.5.0

Added

Introduced significant performance optimizations via compiler flags:
- -Ofast, -march=native, -ffast-math, -fomit-frame-pointer added to release builds (CMake and build script).
- Enabled Link-Time Optimization (LTO) (-flto) for release builds.
- Enabled Interprocedural Optimization (IPO) via CMake where supported.
Added compiler-specific hints for potential performance gains:
- VSORT_RESTRICT (__restrict__) for pointers.
- VSORT_LIKELY/VSORT_UNLIKELY (__builtin_expect) for branch prediction.
- inline and __attribute__((hot)) / VSORT_HOT for critical functions.
Introduced VSORT_HOT macro to manage __attribute__((hot)) portability.

Changed

Version bumped to 0.5.0.
Enabled Apple Silicon optimizations (USE_APPLE_SILICON_OPTIMIZATIONS) by default in CMake.
Updated and simplified CMake build steps in CI workflow (.github/workflows/apple_silicon_ci.yml).
Updated build instructions in README.md to reflect current CMake usage.
Applied optimization flags to the standalone build_script.sh.
Made internal usage of certain compiler attributes (always_inline, hot) conditional for improved portability.

Fixed

Improved build portability, especially for compilers like MSVC, by abstracting or conditionally compiling GCC/Clang-specific __attribute__ directives.

Removed

Outdated // TODO: comments from source code.
Development roadmap section from README.md.
Unnecessary forward declarations for swap_int/swap_float in vsort.c.

Assets 2

03 Apr 14:15

davidesantangelo

v0.4.0

02b7c42

v0.4.0

Added

Parallel Merge Dispatch (GCD): Implemented parallel execution for merge passes within the parallel sorting algorithm (vsort_parallel_int, vsort_parallel_float) on Apple platforms using dispatch_apply. This replaces the previous sequential merge loop, significantly improving performance for large arrays by leveraging multiple cores during the merge phase.

Changed

vsort_float Optimization: Removed the fallback to standard qsort. vsort_float now uses the internal adaptive sorting framework (parallel dispatch, quicksort, insertion sort) consistent with vsort for integers.
Hardware Detection (macOS): Improved robustness and efficiency by replacing popen calls with direct sysctlbyname system calls for detecting CPU cores, cache sizes, and other hardware characteristics.
Benchmark Mergesort: Refactored the custom_mergesort implementation in examples/benchmark.c to use a single, pre-allocated temporary buffer, making it more efficient and a fairer comparison against vsort.
Documentation (README.md): Significantly updated to accurately reflect the v0.4.0 implementation status, including the new parallel merge dispatch, NEON status (planned), revised technical details, and updated roadmap. Removed placeholder performance tables.
Quicksort Stack: Increased default stack capacity in iterative quicksort and improved safety checks.

Fixed

Build Warnings/Errors: Resolved various compiler warnings identified in previous builds, including MIN/MAX macro redefinitions (renamed to VSORT_MIN/VSORT_MAX), unused variables (pushed, timing_errors), unused functions (P/E core distribution logic, default float comparator), and issues with posix_memalign preprocessor checks.
Benchmark Example (examples/benchmark.c): Replaced INFINITY with DBL_MAX to resolve warnings/issues related to the -ffast-math flag. Moved srand call to the beginning of main for proper seeding. Corrected algorithm name reference from std::sort to std_sort. Improved result reporting logic.

Assets 2

23 Mar 19:43

davidesantangelo

v0.3.4

42c43fa

v0.3.4

Added

Improved P/E core utilization with intelligent workload distribution
More precise detection of performance and efficiency cores on Apple Silicon
Chunk complexity analysis to assign work to appropriate core types
Separate dispatch queues with different QoS levels for different core types

Improved

More accurate detection of P-core and E-core counts using sysctl
Better parallel sorting performance through optimized core utilization
Power efficiency by directing appropriate work to efficiency cores

Assets 2

20 Mar 08:01

davidesantangelo

v0.3.3

2ad7a01

v0.3.3

Added

Dynamic threshold adjustment to auto-tune sorting parameters based on hardware
Hardware detection system that identifies CPU model, cache sizes, and core types
Adaptive calibration for sorting thresholds to optimize performance by hardware
Enhanced CPU detection on both Apple Silicon and Linux/Windows platforms

Improved

Better utilization of CPU cache hierarchy for optimal sorting performance
Refined parallel threshold calculation based on available cores and cache sizes
More intelligent P-core vs E-core utilization on Apple Silicon
Improved documentation of hardware detection and calibration process

Assets 2

19 Mar 15:58

davidesantangelo

v0.3.2

65571ed

v0.3.2

Added

CI testing on Apple Silicon hardware to validate NEON optimizations
Logger system with configurable verbosity levels to replace printf debugging

Improved

Removed all printf debugging statements from production code
Better error handling with proper logging instead of console output
Simplified code paths for improved maintainability
Enhanced parallel sorting strategy with better sequential merge for reliability
Added more thorough error detection and validation throughout sorting process
Improved memory alignment handling for NEON SIMD operations
Added robust Makefile for simpler build process on all platforms
Properly marked unused functions with attributes to silence compiler warnings

Fixed

Memory alignment issues in SIMD operations
CI pipeline now tests on relevant hardware platforms
Code cleanup for production readiness
CMake build errors with -mfpu=neon flag on Apple Silicon platforms
Corrected boundary checking in parallel merge algorithm to prevent sorting failures
Addressed potential memory issues in performance tests with large arrays
Improved verification checks to detect and recover from incorrect merges
Fixed compiler warnings about unused functions in the codebase

Assets 2

19 Mar 08:04

davidesantangelo

v0.3.1

e02fe38

v0.3.1

Added

Complete NEON vectorization implementation for partition operations.
Enhanced vectorized merge operation with optimized fast paths for homogeneous data segments.
Specialized vector processing for different data patterns.

Improved

Partition function now processes 4 elements at once using NEON SIMD.
Merge function utilizes vectorized comparisons and bulk transfers.
Better detection of sequential data patterns for faster processing.
Significant performance boost especially for large datasets and reverse-sorted data.
Updated benchmark results demonstrating up to 4.4x speedup for reverse-sorted data and up to 5.94x faster sorting than std::qsort.

Assets 2

18 Mar 14:49

davidesantangelo

v0.3.0

7d5526d

v0.3.0

Added

Improved performance for Apple Silicon with P-core and E-core detection
Adaptive chunk sizing based on cache characteristics
New parallel merge function with vectorization support
Work-stealing queue structure for better load balancing

Changed

Completely redesigned parallel sorting implementation
Enhanced thread allocation strategy based on array size and core types
Improved workload distribution for better performance
Better balancing between cache efficiency and parallelism
More efficient parallel merging using a balanced binary tree approach

Fixed

Memory management in parallel sorting operation
Potential performance bottlenecks in larger arrays

Assets 2

17 Mar 14:52

davidesantangelo

v0.2.1

048dd91

v0.2.1

Fixed

Cross-platform compatibility with Windows systems
Memory allocation handling in parallel sorting algorithm
Build system improvements for cross-platform compilation
CMake configuration for Visual Studio compiler flags

Improved

Error handling in quicksort algorithm
Thread allocation for platforms without Grand Central Dispatch
Cache utilization in sorting algorithm
Documentation for Windows compatibility

Assets 2

17 Mar 13:06

davidesantangelo

v0.2.0

ceafafa

v0.2.0

Added

Enhanced NEON vectorization that processes 4 integers simultaneously
Adaptive algorithm selection that detects data patterns for optimal sorting
Custom thread management based on array size and Apple Silicon core characteristics
Specialized fast-path optimizations for common data patterns
Benchmark suite for comparison against standard sorting algorithms
Documentation for performance characteristics and optimization techniques

Improved

Performance on large arrays (up to 30% faster for 1M+ elements)
Memory efficiency through optimized cache line alignment
Handling of nearly-sorted and reverse-sorted data, now up to 3.5x faster
Work distribution across Apple Silicon's heterogeneous cores
Branch prediction by optimizing decision paths in sorting algorithms
Compiler optimizations specific to Apple Silicon

Fixed

Memory access patterns to reduce cache misses on Apple Silicon
Thread allocation issues on heavily loaded systems
Performance degradation on certain data distribution patterns
Inconsistent behavior with large arrays (>2M elements)

Performance

Random data: Outperforms quicksort by up to 12%
Nearly-sorted data: Up to 40% performance improvement
Reverse-sorted data: Up to 3.5x faster processing
Large arrays: Memory overhead reduced by 15% compared to previous version

Documentation

Added detailed benchmarks and performance comparison tables
Expanded explanation of Apple Silicon optimizations
Improved usage examples and API documentation
Added troubleshooting section for common issues

Assets 2

Releases: davidesantangelo/vsort

v1.0.0

Added

Changed

Fixed

Uh oh!

v0.5.0

Added

Changed

Fixed

Removed

Uh oh!

v0.4.0

Added

Changed

Fixed

Uh oh!

v0.3.4

Added

Improved

Uh oh!

v0.3.3

Added

Improved

Uh oh!

v0.3.2

Added

Improved

Fixed

Uh oh!

v0.3.1

Added

Improved

Uh oh!

v0.3.0

Added

Changed

Fixed

Uh oh!

v0.2.1

Fixed

Improved

Uh oh!

v0.2.0

Added

Improved

Fixed

Performance

Documentation

Uh oh!