Skip to content

Conversation

@fitzgen
Copy link
Member

@fitzgen fitzgen commented Apr 18, 2025

choose and choose_iter incorrectly claimed to return Error::NotEnoughData when they in fact default to the first choice. This also documents that default in various other APIs.

Additionally, int_in_range (and APIs that rely on it) has bias for non-power-of-two ranges.
u.int_in_range(0..=170) for example will consume one byte of entropy, and take its value modulo 171 (the size of the range) to generate the returned integer.
As a result, values in 0..=84 (the first ~half of the range) are twice as likely to get chosen as the rest
(assuming the underlying bytes are uniform).
In general, the result distribution is only uniform if the range size is a power of two (where the modulo just masks some bits).

It would be accurate to document that return values are biased towards lower values when the range size is not a power of two, but do we want this much detail in the documented “contract” of this method?

Similarly, I just called ratio “approximate”. u.ratio(5, 7) returns true for 184 out of 256 possible underlying byte values, ~0.6% too often. In the worst case, u.ratio(84, 170) return true ~33% too often.

Notably, #[derive(Arbitrary)] chooses enum variants not with choose_index (although that seems most appropriate from reading Unstructured docs) but by always consuming 4 bytes of entropy:

// Use a multiply + shift to generate a ranged random number
// with slight bias. For details, see:
// https://lemire.me/blog/2016/06/30/fast-random-shuffling
Ok(match (u64::from(<u32 as arbitrary::Arbitrary>::arbitrary(u)?) * #count) >> 32 {
    #(#variants,)*
    _ => unreachable!()
})

int_in_range tries to minimize consumption based on the range size but that contributes to having more bias than multiply + shift. Is this a real trade-off worth having two methods?

@fitzgen
Copy link
Member Author

fitzgen commented Apr 18, 2025

Rebased version of #184 to fix conflicts.

`choose` and `choose_iter` incorrectly claimed to return `Error::NotEnoughData`
when they in fact default to the first choice. This also documents
that default in various other APIs.

Additionally, `int_in_range` (and APIs that rely on it) has bias
for non-power-of-two ranges.
`u.int_in_range(0..=170)` for example will consume one byte of entropy,
and take its value modulo 171 (the size of the range)
to generate the returned integer.
As a result, values in `0..=84` (the first ~half of the range)
are twice as likely to get chosen as the rest
(assuming the underlying bytes are uniform).
In general, the result distribution is only uniform if the range size
is a power of two (where the modulo just masks some bits).

It would be accurate to document that return values
are biased towards lower values when the range size is not a power of two,
but do we want this much detail in the documented “contract” of this method?

Similarly, I just called `ratio` “approximate”. `u.ratio(5, 7)` returns true
for 184 out of 256 possible underlying byte values, ~0.6% too often.
In the worst case, `u.ratio(84, 170)` return true ~33% too often.

Notably, `#[derive(Arbitrary)]` chooses enum variants not with `choose_index`
(although that seems most appropriate from reading `Unstructured` docs)
but by always consuming 4 bytes of entropy:

```rust
// Use a multiply + shift to generate a ranged random number
// with slight bias. For details, see:
// https://lemire.me/blog/2016/06/30/fast-random-shuffling
Ok(match (u64::from(<u32 as arbitrary::Arbitrary>::arbitrary(u)?) * #count) >> 32 {
    #(#variants,)*
    _ => unreachable!()
})
```

`int_in_range` tries to minimize consumption based on the range size
but that contributes to having more bias than multiply + shift.
Is this a real trade-off worth having two methods?
@fitzgen fitzgen merged commit 3343511 into rust-fuzz:main Apr 18, 2025
6 checks passed
@fitzgen fitzgen deleted the docs branch April 18, 2025 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants