Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new kernels for atan2 #636

Merged
merged 3 commits into from
Dec 1, 2023
Merged

new kernels for atan2 #636

merged 3 commits into from
Dec 1, 2023

Conversation

Ka-zam
Copy link
Contributor

@Ka-zam Ka-zam commented Oct 4, 2023

New kernels for atan2 based on the recently merged arctan work. Almost 40x speedup.

With this PR:

$ volk_profile -R atan2

RUN_VOLK_TESTS: volk_32fc_s32f_atan2_32f(131071,1987)
generic completed in 5091.65 ms
polynomial completed in 2138.51 ms
a_avx2_fma completed in 131.781 ms
a_avx2 completed in 131.696 ms
u_avx2_fma completed in 131.963 ms
u_avx2 completed in 132.086 ms
Best aligned arch: a_avx2
Best unaligned arch: u_avx2_fma

Without:

$ volk_profile -R atan2

RUN_VOLK_TESTS: volk_32fc_s32f_atan2_32f(131071,1987)
a_sse4_1 completed in 5159.66 ms
a_sse completed in 5168.12 ms
generic completed in 5201.91 ms
Best aligned arch: a_sse4_1
Best unaligned arch: generic

Signed-off-by: Magnus Lundmark <[email protected]>
Signed-off-by: Magnus Lundmark <[email protected]>
Signed-off-by: Magnus Lundmark <[email protected]>
Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR. However, did you remove the SSE kernels? They should stay.

@Ka-zam
Copy link
Contributor Author

Ka-zam commented Oct 13, 2023

There seems to be a dependency on LV_HAVE_LIB_SIMDMATH which I've never heard of and can't find any information on. In my case this simply compiles to the generic case and my machine is definitely capable of SSE4_1.

It is not a SSE4_1 kernel in any case.

Perhaps something like this instead:

#if LV_HAVE_SSE4_1 && LV_HAVE_LIB_SIMDMATH
#include <smmintrin.h>
#include <simdmath.h>

static inline void volk_32fc_s32f_atan2_32f_a_sse4_1_simdmath(float* outputVector,
.
.
.
#ifdef LV_HAVE_SSE4_1
#include <smmintrin.h>

#ifdef LV_HAVE_LIB_SIMDMATH
#include <simdmath.h>
#endif /* LV_HAVE_LIB_SIMDMATH */

static inline void volk_32fc_s32f_atan2_32f_a_sse4_1(float* outputVector,
                                                     const lv_32fc_t* complexVector,
                                                     const float normalizeFactor,
                                                     unsigned int num_points)
{
    const float* complexVectorPtr = (float*)complexVector;
    float* outPtr = outputVector;

    unsigned int number = 0;
    const float invNormalizeFactor = 1.0 / normalizeFactor;

#ifdef LV_HAVE_LIB_SIMDMATH                                           <--------------------------
    const unsigned int quarterPoints = num_points / 4;
    __m128 testVector = _mm_set_ps1(2 * M_PI);
    __m128 correctVector = _mm_set_ps1(M_PI);
    __m128 vNormalizeFactor = _mm_set_ps1(invNormalizeFactor);
    __m128 phase;
    __m128 complex1, complex2, iValue, qValue;
    __m128 keepMask;

    for (; number < quarterPoints; number++) {
        // Load IQ data:
        complex1 = _mm_load_ps(complexVectorPtr);
        complexVectorPtr += 4;
        complex2 = _mm_load_ps(complexVectorPtr);
        complexVectorPtr += 4;
        // Deinterleave IQ data:
        iValue = _mm_shuffle_ps(complex1, complex2, _MM_SHUFFLE(2, 0, 2, 0));
        qValue = _mm_shuffle_ps(complex1, complex2, _MM_SHUFFLE(3, 1, 3, 1));
        // Arctan to get phase:
        phase = atan2f4(qValue, iValue);
        // When Q = 0 and I < 0, atan2f4 sucks and returns 2pi vice pi.
        // Compare to 2pi:
        keepMask = _mm_cmpneq_ps(phase, testVector);
        phase = _mm_blendv_ps(correctVector, phase, keepMask);
        // done with above correction.
        phase = _mm_mul_ps(phase, vNormalizeFactor);
        _mm_store_ps((float*)outPtr, phase);
        outPtr += 4;
    }
    number = quarterPoints * 4;
#endif /* LV_HAVE_LIB_SIMDMATH */                                <--------------------------

    for (; number < num_points; number++) {
        const float real = *complexVectorPtr++;
        const float imag = *complexVectorPtr++;
        *outPtr++ = atan2f(imag, real) * invNormalizeFactor;
    }
}
#endif /* LV_HAVE_SSE4_1 */

@jdemel
Copy link
Contributor

jdemel commented Oct 14, 2023

My search for simdmath.h has led me nowhere so far.
It seems to be part of this commit:
e4015c7
Or even prior. I'd argue it is reasonable to remove it.

Copy link
Contributor

@jdemel jdemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for your contribution.

@jdemel jdemel merged commit 13dcc27 into gnuradio:main Dec 1, 2023
32 checks passed
@jj1bdx
Copy link
Contributor

jj1bdx commented Dec 17, 2023

@Ka-zam and @jdemel
Please take a look at #730 and #731. Your reviews and comments are appreciated.

@argilo
Copy link
Member

argilo commented Dec 17, 2023

My search for simdmath.h has led me nowhere so far.

I think it might be IBM's "Software Development Kit for Multicore Acceleration". It has the same header name, and includes the functions that VOLK references (powf4, logf4, atan2f4, cosf4, sinf4).

http://ilab.usc.edu/packages/cell-processor/docs/CBE_SIMDmath_API_v2.1.pdf

It looks related to the Cell Broadband Engine, which some GNU Radio folks were working on around 2009:

https://www.researchgate.net/publication/241488194_High-Performance_SDR_GNU_Radio_and_the_IBM_Cell_Broadband_Engine

It probably makes sense to strip all the LV_HAVE_LIB_SIMDMATH out of VOLK at this point. I'll open an issue for it.

Alesha72003 pushed a commit to Alesha72003/volk that referenced this pull request May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants