Accelerate single point evaluation for `nmod_poly` #2492

vneiger · 2025-11-15T18:31:28Z

Here are some enhancements of evaluation at an nmod point for nmod_poly. I don't have specific uses for this, this was done as a "warmup" for writing more efficient implementations of reduction modulo polynomials x^n - c for n >= 1 (draft started at #2470 , itself useful for the in-progress FFT #2107). But since this seems to accelerate the existing code in all cases, this might as well be merged in(?).

Acceleration of existing cases (mainly through unrolling loops):

polynomials of small length (up to about 10) are unaffected
speed-up becomes substantial for length 32 (factor 2)
speed-up between 2.5 and 3 for large lengths

See first table below: for each modulus bitsize, the first column measures the old version, the second column measures the new one. (And some time ago, we only had the very first column, for all moduli...)

Adding specific functions for evaluation at +1 and -1.

the main evaluation function detects these cases and chooses the relevant function depending on the bitsize of the modulus
independently of the modulus bitsize, the speed-up is consistently about 4, versus the best general (not specific to 1 or -1) variant we have at hand for that bitsize (see second table below).

Intel(R) Core(TM) Ultra 7 165H
length |    64 bits    |    63 bits    |   <= 62 bits
1      | 6.71    8.57  | 13.15   13.18 | 13.15   13.17
2      | 4.52    4.68  | 3.42    4.04  | 3.37    3.58
3      | 4.04    4.18  | 2.98    3.63  | 2.47    2.61
4      | 4.91    4.84  | 3.01    3.25  | 2.33    2.34
6      | 5.93    5.90  | 3.39    3.49  | 2.17    2.19
8      | 6.61    6.65  | 3.73    3.84  | 2.10    2.12
10     | 7.08    7.15  | 4.05    4.12  | 2.04    2.10
12     | 8.02    6.14  | 4.32    4.21  | 2.09    2.16
16     | 9.22    5.71  | 4.71    3.88  | 2.28    3.07
20     | 9.95    5.46  | 5.34    3.57  | 2.45    2.88
32     | 11.06   5.09  | 6.34    3.29  | 2.81    2.47
45     | 11.59   4.91  | 6.77    3.13  | 3.46    2.30
64     | 12.02   4.75  | 7.14    2.96  | 3.95    2.15
128    | 12.46   4.59  | 7.59    2.85  | 4.51    2.02
256    | 12.78   4.52  | 7.80    2.73  | 4.87    1.92
1024   | 12.90   4.41  | 7.92    2.64  | 5.09    1.87
8192   | 12.93   4.41  | 8.06    2.65  | 5.16    1.86
65536  | 12.99   4.39  | 7.98    2.63  | 5.19    1.86
200000 | 13.03   4.42  | 7.97    2.65  | 5.20    1.86
1000000| 13.04   4.48  | 8.01    2.69  | 5.22    1.92

AMD Ryzen 7 PRO 7840U
nbits = 62
length   generic precomp lazy    one     mone
1        8.07    7.51    7.69    6.11    6.23
2        6.34    4.39    4.18    4.72    5.06
3        6.38    4.36    3.40    3.15    3.61
4        6.75    4.41    3.25    2.55    2.95
6        7.76    5.28    3.92    2.23    1.90
8        8.86    5.44    3.71    1.71    1.88
10       9.96    5.74    3.70    1.45    1.53
12       7.87    5.39    3.55    1.20    1.32
16       7.02    5.03    4.13    1.27    1.32
20       6.76    4.78    3.73    1.05    1.08
32       6.37    4.47    3.24    0.88    0.92
45       6.00    4.14    2.79    0.73    0.75
64       5.66    3.93    2.58    0.75    0.74
128      5.46    3.72    2.31    0.64    0.66
256      5.34    3.62    2.21    0.60    0.59
1024     5.21    3.55    2.08    0.56    0.55
8192     5.21    3.54    2.06    0.57    0.56
65536    5.32    3.55    2.15    0.58    0.55
200000   5.44    3.55    2.08    0.57    0.55
1000000  5.24    3.56    2.07    0.58    0.55

nbits = 63
length   generic precomp one     mone
1        8.18    7.94    6.36    6.38
2        6.38    4.46    5.09    4.78
3        6.36    4.29    3.63    3.64
4        6.61    4.56    3.00    3.53
6        8.13    5.42    2.17    2.25
8        9.00    5.49    2.01    2.03
10       9.73    5.53    1.77    1.79
12       8.04    5.38    1.62    1.62
16       7.13    5.28    1.44    1.43
20       6.82    4.73    1.29    1.30
32       6.20    4.35    1.10    1.11
45       6.04    4.17    1.02    1.07
64       5.67    3.95    1.02    1.00
128      5.62    3.70    0.91    0.93
256      5.33    3.75    0.88    0.88
1024     5.26    3.55    0.86    0.86
8192     5.23    3.66    0.88    0.87
65536    5.25    3.52    0.86    0.85
200000   5.29    3.55    0.85    0.85
1000000  5.27    3.55    0.90    0.87

nbits = 64
length   generic one     mone
1        8.38    6.40    6.33
2        6.45    5.42    5.15
3        6.37    3.40    3.67
4        6.59    2.67    3.08
6        7.89    2.54    2.35
8        8.96    2.18    2.26
10       9.76    2.11    2.03
12       7.91    1.85    1.92
16       7.18    1.70    1.74
20       6.83    1.63    1.65
32       6.19    1.46    1.48
45       6.07    1.42    1.40
64       5.88    1.35    1.35
128      5.50    1.27    1.28
256      5.35    1.24    1.24
1024     5.26    1.21    1.21
8192     5.39    1.24    1.25
65536    5.23    1.22    1.23
200000   5.25    1.22    1.22
1000000  5.26    1.25    1.24

previous version - makes test more robust (more iterations; part of it focusing on large bitlengths)

- insert them in testing and profile

albinahlback · 2025-11-15T20:01:40Z

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

vneiger · 2025-11-15T20:47:54Z

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Well, I was initially surprised to see that unrolling helped here, on such simple loops. But then, actually unrolling by hand did involve some logic: to avoid a dependency in consecutive loop iterations, I use the fourth power c^4 of the point c at which we evaluate. I guess the compiler will not introduce this kind of logic --- especially since this is an nmod power, not just c*c*c*c, and I suspect that there is maybe no plain, simple unrolling that would be efficient...? (I'm actually hoping that the explanation is not this one, because this is not the case anymore in the more general remainder modulo x^n - c: as soon as n >= 4 or so, you can unroll more trivially without using powers of c.)

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Sure, it's interesting to know what this gives in any case. I'll have a look and report here.

fredrik-johansson · 2025-11-16T01:06:44Z

Evaluation at 1 is just the sum of coefficients. Would it not be faster to add_ssaaaa up everything and then reduce?

Or does the compiler generate good SIMD code for the conditional subtractions?

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

vneiger · 2025-11-16T14:17:04Z

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Here is the assembly with gcc and clang. Through a (too) quick inspection, I don't see a huge difference, but maybe you will.
ev_nmod.zip
Also, not visible in the files, but -funroll-loops for gcc does basically nothing for the manually unrolled version of the function, and does not do much for the "natural" version (it does a little thing but this does not look like unrolling the main loop).

I've tried to run the profile file with clang but got into an issue. prof_repeat seems to never finish, just increasing the count variable indefinitely. I tried making sure the content of the profiled part is not compiler-optimized away or something like this, but this did not help. I thought I had already used clang and prof_repeat before, without this issue. Any quick hint at what I might be doing wrong? otherwise I'll open an issue.

vneiger · 2025-11-19T21:04:12Z

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Comparison of the version presently in main (which I guess is the target of your question, about how compiler unrolling behaves) does not show much difference, as suggested by a quick review of the assembly:

        GCC                    CLANG
4.82    4.26    4.06  |  4.89    4.33    4.23
5.24    3.56    2.82  |  5.45    3.89    2.92
6.14    4.07    2.92  |  6.74    4.65    3.11
7.52    4.96    3.07  |  8.19    5.80    3.36
8.58    5.05    3.32  |  9.06    6.47    3.72
9.60    5.32    3.29  |  10.36   5.79    3.69
10.60   5.60    3.30  |  11.17   6.15    3.93
11.75   6.43    3.85  |  12.91   6.79    4.36
12.36   6.78    4.20  |  13.51   7.51    4.65
13.41   7.45    4.71  |  14.56   8.26    5.31
13.73   7.70    4.93  |  15.27   8.81    5.54
14.22   7.85    5.10  |  15.82   8.93    5.78
14.52   8.22    5.34  |  16.27   9.21    6.00
14.73   8.34    5.43  |  16.28   9.48    6.11
14.81   8.43    5.52  |  16.64   9.62    6.20
14.93   8.44    5.71  |  16.66   9.69    6.23
14.88   8.49    5.50  |  16.74   9.72    6.21

Just to comment on the above-mentioned "dependency in iterations": in the 4-unrolled version, each loop iteration is handling 4 coefficients with no dependency between the 4 computations. It roughly looks like

a0 = stuff; a1 = stuff; a2 = stuff; a3 = stuff
for k ...  k+=4
    a0 = poly[4*k+0] + c4 * a0
    a1 = poly[4*k+1] + c4 * a1
    a2 = poly[4*k+2] + c4 * a2
    a3 = poly[4*k+3] + c4 * a3
a = a0 + a1*c + a2*c**2 + a3*c**3

(where "+" and "*" are nmod operations and c4 = c**4 is computed once for all). Whereas a "direct" unrolling would lead to dependencies at each step of the iteration body, such as

a = stuff
for k ...  k+=4
    a = poly[4*k+0] + c * a
    a = poly[4*k+1] + c * a
    a = poly[4*k+2] + c * a
    a = poly[4*k+3] + c * a

I'm not sure these avoided dependencies are the main explanation behind the difference of speed, but this should at least help things go faster.

vneiger · 2025-11-19T21:27:46Z

Evaluation at 1 is just the sum of coefficients. Would it not be faster to add_ssaaaa up everything and then reduce?

At least for 64-bit moduli I suppose this will be better than what I wrote with lots of n_addmod.

Or does the compiler generate good SIMD code for the conditional subtractions?

Well, I was hoping it was (especially with avx512 available on zen4 which makes these a bit easier with masking and a bit faster)... but now, looking at the generated assembly, while clang vectorize, gcc does not. Both have comparable performance.

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

Thanks for the idea. I have not encountered this before, I think, any pointer to some place where this would arise already? (Is the cost of conversions not expected to be a problem for such a fast operation, sum of a vector?)

In any case, I'll investigate the "1" and "-1" cases more. By the way, I guess at least the "1" case (the sum) should belong to nmod_vec.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

This would mean first computing / storing the vector of powers of the evaluation point, is this what you suggest? If yes, I thought this preliminary step would compensate any benefit of the fast dot product, but I may be wrong, I will have a look.

fredrik-johansson · 2025-11-19T21:47:07Z

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

Thanks for the idea. I have not encountered this before, I think, any pointer to some place where this would arise already? (Is the cost of conversions not expected to be a problem for such a fast operation, sum of a vector?)

I don't know if's actually fast in practice, just an idea to get around the lack of direct carry handling in SIMD.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

This would mean first computing / storing the vector of powers of the evaluation point, is this what you suggest? If yes, I thought this preliminary step would compensate any benefit of the fast dot product, but I may be wrong, I will have a look.

Basically you could use rectangular or modular splitting, so you'd compute O(sqrt(n)) powers at evaluation time and end up with O(sqrt(n)) modular reductions (though probably the optimal value will not look quite like O(sqrt(n))).

vneiger added 2 commits November 15, 2025 11:23

- speed-up nmod evaluation via unrolling loops; always faster than

8263b5b

previous version - makes test more robust (more iterations; part of it focusing on large bitlengths)

- add specific, faster evaluation versions for points +1 and -1

655cfff

- insert them in testing and profile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accelerate single point evaluation for `nmod_poly` #2492

Accelerate single point evaluation for `nmod_poly` #2492

vneiger commented Nov 15, 2025

Uh oh!

albinahlback commented Nov 15, 2025

Uh oh!

vneiger commented Nov 15, 2025

Uh oh!

fredrik-johansson commented Nov 16, 2025 •

edited

Loading

Uh oh!

vneiger commented Nov 16, 2025

Uh oh!

vneiger commented Nov 19, 2025 •

edited

Loading

Uh oh!

vneiger commented Nov 19, 2025

Uh oh!

fredrik-johansson commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Accelerate single point evaluation for nmod_poly #2492

Are you sure you want to change the base?

Accelerate single point evaluation for nmod_poly #2492

Conversation

vneiger commented Nov 15, 2025

Uh oh!

albinahlback commented Nov 15, 2025

Uh oh!

vneiger commented Nov 15, 2025

Uh oh!

fredrik-johansson commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vneiger commented Nov 16, 2025

Uh oh!

vneiger commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vneiger commented Nov 19, 2025

Uh oh!

fredrik-johansson commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Accelerate single point evaluation for `nmod_poly` #2492

Accelerate single point evaluation for `nmod_poly` #2492

fredrik-johansson commented Nov 16, 2025 •

edited

Loading

vneiger commented Nov 19, 2025 •

edited

Loading