Skip to content

Conversation

@Ka-zam
Copy link
Contributor

@Ka-zam Ka-zam commented Jan 20, 2026

Prior fix only handles neon kernels, the same error existed for x64 kernels. Also increased accuracy and added edge case testing. Refactored polynomial to intrinsics.

RUN_VOLK_TESTS: volk_32f_log2_32f(vlen=131071, iter=1000, tol=5e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    |  583.3820 ms |    1797.5 MB/s |          - |
u_sse4_1                   |   39.6278 ms |   26461.4 MB/s |    4.3e-06 |
a_sse4_1                   |   39.7973 ms |   26348.7 MB/s |    4.3e-06 |
u_avx2                     |   20.2697 ms |   51732.9 MB/s |    4.3e-06 |
a_avx2                     |   20.3882 ms |   51432.1 MB/s |    4.3e-06 |
u_avx2_fma                 |   20.3909 ms |   51425.4 MB/s |    4.3e-06 |
a_avx2_fma                 |   20.3383 ms |   51558.4 MB/s |    4.3e-06 |
u_avx512                   |   11.8731 ms |   88317.9 MB/s |    4.3e-06 |
a_avx512                   |   11.8860 ms |   88222.2 MB/s |    4.3e-06 |
u_avx512dq                 |   11.8220 ms |   88699.5 MB/s |    4.3e-06 | *
a_avx512dq                 |   11.8691 ms |   88347.6 MB/s |    4.3e-06 |
Best aligned arch          | u_avx512dq (49.35x)
Best unaligned arch        | u_avx512dq (49.35x)
--------------------------------------------------------------------------------

RUN_VOLK_TESTS: volk_32f_log2_32f(vlen=131071, iter=1000, tol=5e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    |  992.7362 ms |    1056.3 MB/s |          - |
u_sse4_1                   |  193.5862 ms |    5416.7 MB/s |    4.1e-06 |
a_sse4_1                   |  192.6163 ms |    5444.0 MB/s |    4.1e-06 |
u_avx2                     |  174.0292 ms |    6025.5 MB/s |    4.1e-06 | *
a_avx2                     |  173.9084 ms |    6029.7 MB/s |    4.1e-06 | *
u_avx2_fma                 |  175.2591 ms |    5983.2 MB/s |    4.1e-06 |
a_avx2_fma                 |  174.2142 ms |    6019.1 MB/s |    4.1e-06 |
Best aligned arch          | a_avx2 (5.71x)
Best unaligned arch        | u_avx2 (5.70x)
--------------------------------------------------------------------------------

RUN_VOLK_TESTS: volk_32f_log2_32f(vlen=131071, iter=1000, tol=5e-06)
arch                       |         time |     throughput |    max_abs |
---------------------------+--------------+----------------+------------+
generic                    | 1060.9555 ms |     988.4 MB/s |          - |
neon                       |  230.3564 ms |    4552.1 MB/s |    4.8e-06 |
neonv8                     |  194.7116 ms |    5385.4 MB/s |    3.9e-06 | *
Best aligned arch          | neonv8 (5.45x)
Best unaligned arch        | neonv8 (5.45x)
--------------------------------------------------------------------------------

Signed-off-by: Magnus Lundmark <[email protected]>
Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I really appreciate that you add those intrinsics. Helps to understand the code better.

@jdemel jdemel merged commit 3cf2f53 into gnuradio:main Jan 29, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants