More options for (weighted) reservoir sampling

## Summary

Recently, while doing research for a blog post, I implemented two different schemes for weighted reservoir sampling: the [A-Res algorithm by Efraimidis and Spirakis](https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_A-Res), and [VarOptK by Cohen et al.](http://www.cohenwang.com/edith/Papers/varopt-full.pdf).

I'd like to see both of these upstreamed here, as they use different interpretations of weights: A-Res is equivalent to (though much more efficient than) collecting the data elements and their weights into a discrete distribution, then sampling $n$ elements from that distribution, removing the selected element from it each time. VarOptK doesn't have an obvious sequential analog, but it has the very nice property that the probability of an element being included in the sample will be proportional to that element's relative weight (`weight_i / total_weights`), if possible.

At the very least, I'd like to add two new methods to `IteratorRandom`: `choose_multiple_weighted` and `choose_multiple_weighted_proportionally`. The names are obviously subject to bikeshedding, I'm not entirely happy with them either.

I also propose exposing an advanced API that these methods would simply wrap. While I personally don't have a use case for this, a major reason to want a reservoir sampling algorithm is that they can take in the sampled population one element at a time, and at any point extract a sample of the part of the population seen so far. Essentially, I want to decouple the implementation of these algorithms from the `Iterator` trait, and allow users to do this, which would add indirect support for sampling infinite iterators or async streams, for example. I also suggest giving the existing `IteratorRandom::choose_multiple` the same treatment.

## Details

I'd add a module `rand::reservoir` containing three types corresponding to these three algorithms, each with essentially the same API:

```rust
struct Reservoir<T> { /* ... */ }
impl<T> Reservoir<T> {
    fn with_sample_size(n: usize) -> Self { /* ... */ }
    fn feed<R: Rng + ?Sized>(&mut self, rng: &mut R, item: T, weight: f64) -> ??? {
        // Should this simply return `bool` to indicate whether this item was selected?
        // Or a more complex type, like `Result<Option<T>, T>`, to return `Err(item)` if it wasn't selected,
        // and `Ok` with the potentially replaced item from the reservoir if it was?
        // Do we need a return value at all?
    }
    fn current_sample(&self) -> impl Iterator<Item = &T> { /* ... */ }

    // Maybe also something like this:
    fn feed_to_next_replacement<I, F>(&mut self, rng: &mut R, items: &mut I, weights: F) -> Option<T>
    where
        R: Rng + ?Sized,
        I: Iterator<Item = T>,
        F: FnMut(&T) -> f64,
    { /* return `None` if `items` is exhausted before the sample is updated, `Some(replaced_item)` otherwise */ }

    // Other convenience functions?
}

impl<T> IntoIterator for &Reservoir<T> { /* equivalent to `current_sample()` */ }
impl<T> IntoIterator for Reservoir<T> { /* consume the Reservoir to get an iterator of owned items */ }
impl<T> Into<Vec<T>> for Reservoir<T> { /* equivalent to, but potentially more efficient than, `into_iter().collect()` */ }

// Other traits to implement?
```

Again, I'm not sure what to name these types. The best descriptive names I can come up with would be along the lines of `Uniform`, `Weighted`, and `ProportionallyWeighted`, but there's the unfortunate name clashing. Naming the structs after the corresponding algorithms or their inventors would be an option, but that's very opaque, I feel. Suggestions would be appreciated.

Anyway, from this, we'd get:

```rust
pub trait IteratorRandom: Iterator + Sized {
    /* other methods omitted */
    
    fn choose_multiple<R>(mut self, rng: &mut R, amount: usize) -> Vec<Self::Item>
    where
        R: Rng + ?Sized,
    {
        let mut reservoir = crate::reservoir::Uniform::with_sample_size(amount);
        for item in self {
            reservoir.feed(rng, item);
        }
        reservoir.into()
    }
    
    // and analogously:
    fn choose_multiple_weighted<R, F>(mut self, rng: &mut R, amount: usize, weight: F) -> Vec<Self::Item> where ...
    fn choose_multiple_weighted_proportionally<R, F>(/* ... */) -> Vec<Self::item> where ...
}
```

## Alternatives and Concluding Notes

I guess there's a case to be made for making a new trait `ReservoirSampler` (or similar) that all three structs would implement, and making the `choose_multiple` method generic over the reservoir type, but that seems over-engineered to me.

I think it's also worth discussing whether to also extend the `SliceRandom` trait. If I'm reading the implementation of `SliceRandom::choose_multiple_weighted` right, it's equivalent to `slice.iter().choose_multiple_weighted()`, so maybe it should also get an equivalent to `slice.iter().choose_multiple_weighted_proportionally()`? That doesn't seem necessary to me either, but users may expect that kind of symmetry.

I could probably bang out a pull request for all this relatively quickly - I already have working, if rudimentary implementations, after all - , I just wanted to gauge interest first, and get some opinions on these surface level questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

More options for (weighted) reservoir sampling #1365

Summary

Details

Alternatives and Concluding Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

More options for (weighted) reservoir sampling #1365

Description

Summary

Details

Alternatives and Concluding Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions