-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster NMOD_RED using different precomputation? #2061
Comments
Impressive! See also #1950, #1926, #1823. What we want ultimately is to introduce an n_mod context object that is passed by reference rather than by value and which is not stored inside matrices and polynomials. This will give us more flexibility to store additional useful precomputed data like this. Can you also speed up reduction from 2 or 3 limbs building on the same trick? |
Sounds good indeed!
It is usable for reducing from more limbs at least with naive approaches (*), but I'm not sure at the moment how efficient that would be. And this may restrict (*) e.g. precompute not only |
Some attempts at 2 and 3 limbs are rather promising (assuming I compare everything properly, this was done a bit quickly). This is cycles/limb, up to some constant factor, for reducing a length-1000 array. First 4 columns: reducing each entry of the array modulo
I'll have a look at the many-limb case when possible. |
NMOD_RED
reduces a single limb modulo some givenn
. It currently callsNMOD_RED2
which reduces a two-limb(a_hi, a_lo)
integer modn
, wherea_hi
must be< n
(when called forNMOD_RED
, it is zero).Another way to reduce a single limb
a
modn
is to use the approach ofn_mulmod_shoup
to compute1*a mod n
, so with the fixed operand 1:q = floor(2**64 / n)
(which is the output ofn_mulmod_precomp_shoup
with input1
)n_mulmod_shoup(1L, a, q, n)
The precomputation is... a precomputation. It only depends on
n
, not ona
.The call to
n_mulmod_shoup
simply does oneumul_ppmm
to get somep_hi
which is eitherfloor(a/n)
orfloor(a/n)-1
, then one single-word multiplicationres = a - p_hi * n
, and finally a single correction step (ifres >= n
, then returnres - n
else returnres
).In this particular case, this actually works even when
n
has 64 bits, there is absolutely no restriction onn
ora
(usuallyn_mulmod_shoup
requiresn
to have at most 63 bits).NMOD_RED2
has an additionaladd_ssaaaa
, some shifts here and there, and two correction steps.Using the
mulmod_shoup
approach seems beneficial, for example when runningnmod_vec/profile/p-reduce
on zen4 I get the following values (left is existing code, right is usingmulmod_shoup
as showed below):(2.4 goes down to 2.0 if explicitly unrolling the loop; unrolling does not seem to help the
NMOD_RED
variant)These timings were obtained by simply changing the reduce function, adding the precomputation and calling
n_mulmod_shoup
instead ofNMOD_RED
:There should be other places where this may be beneficial. However for more generalized use, it would probably help to add this
one_precomp
to the structnmod_t
, but I suspect there are good reasons for trying to avoid augmenting this struct?The text was updated successfully, but these errors were encountered: