-
Notifications
You must be signed in to change notification settings - Fork 275
Description
Consider n < 256. We get 8x memory reduction using nmod8 instead of nmod, so it would be interesting to optimize the arithmetic in the nmod8 representation. Some low-hanging fruit here is to do batch modular reduction using SIMD operations.
For really small n we can do even better by bit packing. Obviously for n = 2 we want to use a single word to represent a chunk of 64 entries. Polynomial multiplication is easy on modern hardware since there are now dedicated CLMUL instructions
What about other tiny n?
We can store an entry of Z/3Z using 2 bits, so it's possible to pack 32 entries into a word. Packing as tightly as possible is not obviously optimal: using 3 or 4 bits or even 6 or 8 bits has the benefit that one can do some arithmetic in-place before reducing. When packing to 2 bits, the article https://doi.org/10.36045/bbms/1369316548 discusses a nonstandard encoding which makes the arithmetic operations easier to implement using bitwise operations. To figure out what works best, one probably needs some implementation experiments.
It should be possible to do some more small cases like n = 5 and 7 very similarly.