You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Yes what you propose would indeed reduce the latency by 1 clock cycle: from 5 of hadd to 4=1(unpackhi)+3(add).
But at the end, the reduction code is not so important, the important part is the loop body.
And in general, level 1 BLAS routines are not so important in what we do and can gain much less from optimization, compared to level 2 and especially 3 routines, and therefore they received less attention.
What I would found the most important reason to implement your improvement would be to get rid of the dependency on SSE3 in case of targeting machines with capabilities up to SSE2. I don't know if this is the case for you. The choice to target SSE3 (i.e. the Core microarchitecture) was to have a reasonable trade-off between handiness and availability of ISAs, also on embedded devices, which usually lag a bit behind.
Sure if you want to make the changes and make a PR, I would be happy to merge it. But otherwise I would leave it as it is for now, other stuff has higher priority from my side.
In the 'reduce' step of
blasfeo_ddot
, a horizontal add_mm_hadd_pd
is computed. Instead, one could replacewith
effectively trading a packed double operation with a scalar one.
The text was updated successfully, but these errors were encountered: