-
Notifications
You must be signed in to change notification settings - Fork 276
Accelerate single point evaluation for nmod_poly
#2492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
previous version - makes test more robust (more iterations; part of it focusing on large bitlengths)
- insert them in testing and profile
|
Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push Can you make a comparison when using Clang? It does not use assembly for |
Well, I was initially surprised to see that unrolling helped here, on such simple loops. But then, actually unrolling by hand did involve some logic: to avoid a dependency in consecutive loop iterations, I use the fourth power
Sure, it's interesting to know what this gives in any case. I'll have a look and report here. |
|
Evaluation at 1 is just the sum of coefficients. Would it not be faster to Or does the compiler generate good SIMD code for the conditional subtractions? Another idea that could be good for SIMD is to do the sum both in The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions. |
Here is the assembly with gcc and clang. Through a (too) quick inspection, I don't see a huge difference, but maybe you will. I've tried to run the profile file with clang but got into an issue. |
Comparison of the version presently in main (which I guess is the target of your question, about how compiler unrolling behaves) does not show much difference, as suggested by a quick review of the assembly: Just to comment on the above-mentioned "dependency in iterations": in the 4-unrolled version, each loop iteration is handling 4 coefficients with no dependency between the 4 computations. It roughly looks like (where "+" and "*" are nmod operations and I'm not sure these avoided dependencies are the main explanation behind the difference of speed, but this should at least help things go faster. |
At least for 64-bit moduli I suppose this will be better than what I wrote with lots of
Well, I was hoping it was (especially with avx512 available on zen4 which makes these a bit easier with masking and a bit faster)... but now, looking at the generated assembly, while clang vectorize, gcc does not. Both have comparable performance.
Thanks for the idea. I have not encountered this before, I think, any pointer to some place where this would arise already? (Is the cost of conversions not expected to be a problem for such a fast operation, sum of a vector?) In any case, I'll investigate the "1" and "-1" cases more. By the way, I guess at least the "1" case (the sum) should belong to
This would mean first computing / storing the vector of powers of the evaluation point, is this what you suggest? If yes, I thought this preliminary step would compensate any benefit of the fast dot product, but I may be wrong, I will have a look. |
I don't know if's actually fast in practice, just an idea to get around the lack of direct carry handling in SIMD.
Basically you could use rectangular or modular splitting, so you'd compute O(sqrt(n)) powers at evaluation time and end up with O(sqrt(n)) modular reductions (though probably the optimal value will not look quite like O(sqrt(n))). |
Here are some enhancements of evaluation at an
nmodpoint fornmod_poly. I don't have specific uses for this, this was done as a "warmup" for writing more efficient implementations of reduction modulo polynomialsx^n - cforn >= 1(draft started at #2470 , itself useful for the in-progress FFT #2107). But since this seems to accelerate the existing code in all cases, this might as well be merged in(?).See first table below: for each modulus bitsize, the first column measures the old version, the second column measures the new one. (And some time ago, we only had the very first column, for all moduli...)
+1and-1.