Skip to content

Conversation

@vneiger
Copy link
Collaborator

@vneiger vneiger commented Nov 15, 2025

Here are some enhancements of evaluation at an nmod point for nmod_poly. I don't have specific uses for this, this was done as a "warmup" for writing more efficient implementations of reduction modulo polynomials x^n - c for n >= 1 (draft started at #2470 , itself useful for the in-progress FFT #2107). But since this seems to accelerate the existing code in all cases, this might as well be merged in(?).

  1. Acceleration of existing cases (mainly through unrolling loops):
  • polynomials of small length (up to about 10) are unaffected
  • speed-up becomes substantial for length 32 (factor 2)
  • speed-up between 2.5 and 3 for large lengths

See first table below: for each modulus bitsize, the first column measures the old version, the second column measures the new one. (And some time ago, we only had the very first column, for all moduli...)

  1. Adding specific functions for evaluation at +1 and -1.
  • the main evaluation function detects these cases and chooses the relevant function depending on the bitsize of the modulus
  • independently of the modulus bitsize, the speed-up is consistently about 4, versus the best general (not specific to 1 or -1) variant we have at hand for that bitsize (see second table below).
Intel(R) Core(TM) Ultra 7 165H
length |    64 bits    |    63 bits    |   <= 62 bits
1      | 6.71    8.57  | 13.15   13.18 | 13.15   13.17
2      | 4.52    4.68  | 3.42    4.04  | 3.37    3.58
3      | 4.04    4.18  | 2.98    3.63  | 2.47    2.61
4      | 4.91    4.84  | 3.01    3.25  | 2.33    2.34
6      | 5.93    5.90  | 3.39    3.49  | 2.17    2.19
8      | 6.61    6.65  | 3.73    3.84  | 2.10    2.12
10     | 7.08    7.15  | 4.05    4.12  | 2.04    2.10
12     | 8.02    6.14  | 4.32    4.21  | 2.09    2.16
16     | 9.22    5.71  | 4.71    3.88  | 2.28    3.07
20     | 9.95    5.46  | 5.34    3.57  | 2.45    2.88
32     | 11.06   5.09  | 6.34    3.29  | 2.81    2.47
45     | 11.59   4.91  | 6.77    3.13  | 3.46    2.30
64     | 12.02   4.75  | 7.14    2.96  | 3.95    2.15
128    | 12.46   4.59  | 7.59    2.85  | 4.51    2.02
256    | 12.78   4.52  | 7.80    2.73  | 4.87    1.92
1024   | 12.90   4.41  | 7.92    2.64  | 5.09    1.87
8192   | 12.93   4.41  | 8.06    2.65  | 5.16    1.86
65536  | 12.99   4.39  | 7.98    2.63  | 5.19    1.86
200000 | 13.03   4.42  | 7.97    2.65  | 5.20    1.86
1000000| 13.04   4.48  | 8.01    2.69  | 5.22    1.92
AMD Ryzen 7 PRO 7840U
nbits = 62
length   generic precomp lazy    one     mone
1        8.07    7.51    7.69    6.11    6.23
2        6.34    4.39    4.18    4.72    5.06
3        6.38    4.36    3.40    3.15    3.61
4        6.75    4.41    3.25    2.55    2.95
6        7.76    5.28    3.92    2.23    1.90
8        8.86    5.44    3.71    1.71    1.88
10       9.96    5.74    3.70    1.45    1.53
12       7.87    5.39    3.55    1.20    1.32
16       7.02    5.03    4.13    1.27    1.32
20       6.76    4.78    3.73    1.05    1.08
32       6.37    4.47    3.24    0.88    0.92
45       6.00    4.14    2.79    0.73    0.75
64       5.66    3.93    2.58    0.75    0.74
128      5.46    3.72    2.31    0.64    0.66
256      5.34    3.62    2.21    0.60    0.59
1024     5.21    3.55    2.08    0.56    0.55
8192     5.21    3.54    2.06    0.57    0.56
65536    5.32    3.55    2.15    0.58    0.55
200000   5.44    3.55    2.08    0.57    0.55
1000000  5.24    3.56    2.07    0.58    0.55

nbits = 63
length   generic precomp one     mone
1        8.18    7.94    6.36    6.38
2        6.38    4.46    5.09    4.78
3        6.36    4.29    3.63    3.64
4        6.61    4.56    3.00    3.53
6        8.13    5.42    2.17    2.25
8        9.00    5.49    2.01    2.03
10       9.73    5.53    1.77    1.79
12       8.04    5.38    1.62    1.62
16       7.13    5.28    1.44    1.43
20       6.82    4.73    1.29    1.30
32       6.20    4.35    1.10    1.11
45       6.04    4.17    1.02    1.07
64       5.67    3.95    1.02    1.00
128      5.62    3.70    0.91    0.93
256      5.33    3.75    0.88    0.88
1024     5.26    3.55    0.86    0.86
8192     5.23    3.66    0.88    0.87
65536    5.25    3.52    0.86    0.85
200000   5.29    3.55    0.85    0.85
1000000  5.27    3.55    0.90    0.87

nbits = 64
length   generic one     mone
1        8.38    6.40    6.33
2        6.45    5.42    5.15
3        6.37    3.40    3.67
4        6.59    2.67    3.08
6        7.89    2.54    2.35
8        8.96    2.18    2.26
10       9.76    2.11    2.03
12       7.91    1.85    1.92
16       7.18    1.70    1.74
20       6.83    1.63    1.65
32       6.19    1.46    1.48
45       6.07    1.42    1.40
64       5.88    1.35    1.35
128      5.50    1.27    1.28
256      5.35    1.24    1.24
1024     5.26    1.21    1.21
8192     5.39    1.24    1.25
65536    5.23    1.22    1.23
200000   5.25    1.22    1.22
1000000  5.26    1.25    1.24

  previous version
- makes test more robust (more iterations; part of it focusing on large
  bitlengths)
@albinahlback
Copy link
Collaborator

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 15, 2025

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Well, I was initially surprised to see that unrolling helped here, on such simple loops. But then, actually unrolling by hand did involve some logic: to avoid a dependency in consecutive loop iterations, I use the fourth power c^4 of the point c at which we evaluate. I guess the compiler will not introduce this kind of logic --- especially since this is an nmod power, not just c*c*c*c, and I suspect that there is maybe no plain, simple unrolling that would be efficient...? (I'm actually hoping that the explanation is not this one, because this is not the case anymore in the more general remainder modulo x^n - c: as soon as n >= 4 or so, you can unroll more trivially without using powers of c.)

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Sure, it's interesting to know what this gives in any case. I'll have a look and report here.

@fredrik-johansson
Copy link
Collaborator

fredrik-johansson commented Nov 16, 2025

Evaluation at 1 is just the sum of coefficients. Would it not be faster to add_ssaaaa up everything and then reduce?

Or does the compiler generate good SIMD code for the conditional subtractions?

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 16, 2025

Unrolling is useful here because compilers cannot unroll well whenever there is assembly in the loop? Because we push -funroll-loops for nmod_poly.

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Here is the assembly with gcc and clang. Through a (too) quick inspection, I don't see a huge difference, but maybe you will.
ev_nmod.zip
Also, not visible in the files, but -funroll-loops for gcc does basically nothing for the manually unrolled version of the function, and does not do much for the "natural" version (it does a little thing but this does not look like unrolling the main loop).

I've tried to run the profile file with clang but got into an issue. prof_repeat seems to never finish, just increasing the count variable indefinitely. I tried making sure the content of the profiled part is not compiler-optimized away or something like this, but this did not help. I thought I had already used clang and prof_repeat before, without this issue. Any quick hint at what I might be doing wrong? otherwise I'll open an issue.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 19, 2025

Can you make a comparison when using Clang? It does not use assembly for umul_ppmm.

Comparison of the version presently in main (which I guess is the target of your question, about how compiler unrolling behaves) does not show much difference, as suggested by a quick review of the assembly:

        GCC                    CLANG
4.82    4.26    4.06  |  4.89    4.33    4.23
5.24    3.56    2.82  |  5.45    3.89    2.92
6.14    4.07    2.92  |  6.74    4.65    3.11
7.52    4.96    3.07  |  8.19    5.80    3.36
8.58    5.05    3.32  |  9.06    6.47    3.72
9.60    5.32    3.29  |  10.36   5.79    3.69
10.60   5.60    3.30  |  11.17   6.15    3.93
11.75   6.43    3.85  |  12.91   6.79    4.36
12.36   6.78    4.20  |  13.51   7.51    4.65
13.41   7.45    4.71  |  14.56   8.26    5.31
13.73   7.70    4.93  |  15.27   8.81    5.54
14.22   7.85    5.10  |  15.82   8.93    5.78
14.52   8.22    5.34  |  16.27   9.21    6.00
14.73   8.34    5.43  |  16.28   9.48    6.11
14.81   8.43    5.52  |  16.64   9.62    6.20
14.93   8.44    5.71  |  16.66   9.69    6.23
14.88   8.49    5.50  |  16.74   9.72    6.21

Just to comment on the above-mentioned "dependency in iterations": in the 4-unrolled version, each loop iteration is handling 4 coefficients with no dependency between the 4 computations. It roughly looks like

a0 = stuff; a1 = stuff; a2 = stuff; a3 = stuff
for k ...  k+=4
    a0 = poly[4*k+0] + c4 * a0
    a1 = poly[4*k+1] + c4 * a1
    a2 = poly[4*k+2] + c4 * a2
    a3 = poly[4*k+3] + c4 * a3
a = a0 + a1*c + a2*c**2 + a3*c**3

(where "+" and "*" are nmod operations and c4 = c**4 is computed once for all). Whereas a "direct" unrolling would lead to dependencies at each step of the iteration body, such as

a = stuff
for k ...  k+=4
    a = poly[4*k+0] + c * a
    a = poly[4*k+1] + c * a
    a = poly[4*k+2] + c * a
    a = poly[4*k+3] + c * a

I'm not sure these avoided dependencies are the main explanation behind the difference of speed, but this should at least help things go faster.

@vneiger
Copy link
Collaborator Author

vneiger commented Nov 19, 2025

Evaluation at 1 is just the sum of coefficients. Would it not be faster to add_ssaaaa up everything and then reduce?

At least for 64-bit moduli I suppose this will be better than what I wrote with lots of n_addmod.

Or does the compiler generate good SIMD code for the conditional subtractions?

Well, I was hoping it was (especially with avx512 available on zen4 which makes these a bit easier with masking and a bit faster)... but now, looking at the generated assembly, while clang vectorize, gcc does not. Both have comparable performance.

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

Thanks for the idea. I have not encountered this before, I think, any pointer to some place where this would arise already? (Is the cost of conversions not expected to be a problem for such a fast operation, sum of a vector?)

In any case, I'll investigate the "1" and "-1" cases more. By the way, I guess at least the "1" case (the sum) should belong to nmod_vec.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

This would mean first computing / storing the vector of powers of the evaluation point, is this what you suggest? If yes, I thought this preliminary step would compensate any benefit of the fast dot product, but I may be wrong, I will have a look.

@fredrik-johansson
Copy link
Collaborator

Another idea that could be good for SIMD is to do the sum both in ulong (i.e. mod 2^64) and double simultaneously and finally combine the low and high bits.

Thanks for the idea. I have not encountered this before, I think, any pointer to some place where this would arise already? (Is the cost of conversions not expected to be a problem for such a fast operation, sum of a vector?)

I don't know if's actually fast in practice, just an idea to get around the lack of direct carry handling in SIMD.

The general case can also be reduced asymptotically to dot products with a negligible number of modular reductions.

This would mean first computing / storing the vector of powers of the evaluation point, is this what you suggest? If yes, I thought this preliminary step would compensate any benefit of the fast dot product, but I may be wrong, I will have a look.

Basically you could use rectangular or modular splitting, so you'd compute O(sqrt(n)) powers at evaluation time and end up with O(sqrt(n)) modular reductions (though probably the optimal value will not look quite like O(sqrt(n))).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants