Skip to content

Conversation

jotabulacios
Copy link

This PR optimizes the protocol implementation.

The following optimizations were implemented:

Parallelized the main accumulator loop to distribute the heavy workload across all available CPU cores.

Eliminated heap allocations in hot loops by refactoring functions (compute_p_beta, etc.) to use mutable, stack-allocated buffers instead of returning new Vecs on each call. This significantly reduces pressure on the memory allocator and removes contention between threads.

Introduced a polynomial transposition pre-computation step (transpose_poly_for_svo) to ensure sequential memory access. This dramatically improves CPU cache utilization by converting a slow, strided memory access pattern into cache-friendly contiguous reads.

Results

Initial run

Size Classic (avg median) SVO (avg median) SVO / Classic Ratio
16 2.87 ms 3.21 ms 1.12× (SVO slower)
18 9.87 ms 11.26 ms 1.14× (SVO slower)
20 33.51 ms 42.55 ms 1.27× (SVO slower)

After optimizations

Size Classic (median) SVO (median) SVO / Classic Ratio
16 2.82 ms 2.68 ms 0.95× (SVO faster)
18 9.70 ms 9.42 ms 0.97× (SVO faster)
20 35.85 ms 33.27 ms 0.93× (SVO faster)

@jotabulacios jotabulacios marked this pull request as ready for review October 9, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant