-
Notifications
You must be signed in to change notification settings - Fork 9
Description
This covers multiple related issues with processing checkpoint restore.
-
Loading the checkpoint with an incorrect checksum carries factors (and their count) from the corrupted checkpoint. Bad checkpoint files are renamed, and the job restarts from the beginning but still carries factors from the corrupted checkpoint, storing them in the new checkpoint and adding them to the results.
To reproduce, manually edit the checkpoint file and add some random "factors" and/or adjust the number of factors found. Resume the job. It starts from the beginning but includes all extra factors from the edited checkpoint. If you stop it now, it stores them with the correct checksum and later uses them in the results when finished.
-
The number of found factors is stored in the checkpoint file and is not validated to match the actual number of stored factors. We do not need to store that number in the checkpoint file at all, since it can simply be counted. But for compatibility reasons it might remain as is, though it should be validated when the checkpoint is read.
-
Duplicate factors are not excluded. Combined with the issues above, this can result in duplicates under normal conditions. For example, when an attempt to load the checkpoint from some previous version of mfakto fails, the assignment restarts from the beginning but carries over already found factors from the previous run. When the job completes, the JSON output sorts the factors with the help of the
cmp_int96()function, which can also be used to detect duplicate factors during sorting.
Mfaktc suffers from the same issues. I test some things related to #35 and encounter them with mfakto first, so I open the issue here. But these should be fixed in the mfaktc code first and then backported here.
Another issue not directly related to checkpoint restore, but to the found factors limit.
- Limiting the number of possible factors in the assignment is reported as 20, but the actual limit is incorrectly enforced as 10 (small factors are fake, I manually inject them to test this behavior):
got assignment: exp=875998153 bit_min=67 bit_max=75 (8.70 GHz-days)
Starting trial factoring M875998153 from 2^67 to 2^75 (8.70 GHz-days)
k_min = 84231881580 - k_max = 21563362738595
Using GPU kernel "cl_barrett32_77_gs_2"
Found a valid checkpoint file.
last finished class was: 36
found 8 factors so far: 3 5 7 11 17 17014579560952088052367 328795052957258655073 1475794299830310123119
previous work took 92094 ms
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Sep 27 23:02 | 1551 33.5% | 0.097 1m02s | 8073.24 81206 0.00%
M875998153 has a factor: 17014579560952088052367 (73.849190 bits)
Sep 27 23:02 | 2052 44.5% | 0.098 0m52s | 6785.75 81206 0.00%
M875998153 has a factor: 328795052957258655073 (68.155750 bits)
ERROR: reached limit of 20 factors for this job, try a different range
Starting trial factoring M875998153 from 2^67 to 2^75 (8.70 GHz-days)
k_min = 84231881580 - k_max = 21563362738595
Using GPU kernel "cl_barrett32_77_gs_2"
Found a valid checkpoint file.
last finished class was: 36
found 8 factors so far: 3 5 7 11 17 17014579560952088052367 328795052957258655073 1475794299830310123119
previous work took 92094 ms
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Sep 27 23:03 | 1551 33.5% | 0.098 1m03s | 7990.86 81206 0.00%
M875998153 has a factor: 17014579560952088052367 (73.849190 bits)
Sep 27 23:03 | 2052 44.5% | 0.099 0m53s | 6717.21 81206 0.00%
M875998153 has a factor: 328795052957258655073 (68.155750 bits)
ERROR: reached limit of 20 factors for this job, try a different range
Starting trial factoring M875998153 from 2^67 to 2^75 (8.70 GHz-days)
k_min = 84231881580 - k_max = 21563362738595
Using GPU kernel "cl_barrett32_77_gs_2"
And more importantly, once it reaches the limit, it restarts from the last checkpoint, resulting in an endless loop factoring the same range over and over. The proper way of handling this should be like the "forced StopAfterFactor=2". That is, issue a warning, record the incomplete bit range result, and remove the assignment from the worktodo.txt, proceeding with the next. Or migrate to dynamic memory allocation for the factors, issuing realloc(factors, factorsfound + 1) each time a new factor is found.
Mfaktc does not suffer from the actual limit being 10, not 20 as reported. It can handle up to 19 factors. But when it hits 20 it just restarts the same job from the last checkpoint as well (incorrectly reporting "ERROR: Too many factors found for this job, (>20)..." once it finds exactly the 20th factor).
Most of the issues covered here do not arise under normal operation (without tampering with the checkpoint file), so the severity is not high. But they are possible under some conditions, for example, when migrating the checkpoint from a previous version of the software.