Document bias and behavior when running out of entropy #220
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
chooseandchoose_iterincorrectly claimed to returnError::NotEnoughDatawhen they in fact default to the first choice. This also documents that default in various other APIs.Additionally,
int_in_range(and APIs that rely on it) has bias for non-power-of-two ranges.u.int_in_range(0..=170)for example will consume one byte of entropy, and take its value modulo 171 (the size of the range) to generate the returned integer.As a result, values in
0..=84(the first ~half of the range) are twice as likely to get chosen as the rest(assuming the underlying bytes are uniform).
In general, the result distribution is only uniform if the range size is a power of two (where the modulo just masks some bits).
It would be accurate to document that return values are biased towards lower values when the range size is not a power of two, but do we want this much detail in the documented “contract” of this method?
Similarly, I just called
ratio“approximate”.u.ratio(5, 7)returns true for 184 out of 256 possible underlying byte values, ~0.6% too often. In the worst case,u.ratio(84, 170)return true ~33% too often.Notably,
#[derive(Arbitrary)]chooses enum variants not withchoose_index(although that seems most appropriate from readingUnstructureddocs) but by always consuming 4 bytes of entropy:int_in_rangetries to minimize consumption based on the range size but that contributes to having more bias than multiply + shift. Is this a real trade-off worth having two methods?