-
-
Notifications
You must be signed in to change notification settings - Fork 480
Description
Summary
Recently, while doing research for a blog post, I implemented two different schemes for weighted reservoir sampling: the A-Res algorithm by Efraimidis and Spirakis, and VarOptK by Cohen et al..
I'd like to see both of these upstreamed here, as they use different interpretations of weights: A-Res is equivalent to (though much more efficient than) collecting the data elements and their weights into a discrete distribution, then sampling weight_i / total_weights), if possible.
At the very least, I'd like to add two new methods to IteratorRandom: choose_multiple_weighted and choose_multiple_weighted_proportionally. The names are obviously subject to bikeshedding, I'm not entirely happy with them either.
I also propose exposing an advanced API that these methods would simply wrap. While I personally don't have a use case for this, a major reason to want a reservoir sampling algorithm is that they can take in the sampled population one element at a time, and at any point extract a sample of the part of the population seen so far. Essentially, I want to decouple the implementation of these algorithms from the Iterator trait, and allow users to do this, which would add indirect support for sampling infinite iterators or async streams, for example. I also suggest giving the existing IteratorRandom::choose_multiple the same treatment.
Details
I'd add a module rand::reservoir containing three types corresponding to these three algorithms, each with essentially the same API:
struct Reservoir<T> { /* ... */ }
impl<T> Reservoir<T> {
fn with_sample_size(n: usize) -> Self { /* ... */ }
fn feed<R: Rng + ?Sized>(&mut self, rng: &mut R, item: T, weight: f64) -> ??? {
// Should this simply return `bool` to indicate whether this item was selected?
// Or a more complex type, like `Result<Option<T>, T>`, to return `Err(item)` if it wasn't selected,
// and `Ok` with the potentially replaced item from the reservoir if it was?
// Do we need a return value at all?
}
fn current_sample(&self) -> impl Iterator<Item = &T> { /* ... */ }
// Maybe also something like this:
fn feed_to_next_replacement<I, F>(&mut self, rng: &mut R, items: &mut I, weights: F) -> Option<T>
where
R: Rng + ?Sized,
I: Iterator<Item = T>,
F: FnMut(&T) -> f64,
{ /* return `None` if `items` is exhausted before the sample is updated, `Some(replaced_item)` otherwise */ }
// Other convenience functions?
}
impl<T> IntoIterator for &Reservoir<T> { /* equivalent to `current_sample()` */ }
impl<T> IntoIterator for Reservoir<T> { /* consume the Reservoir to get an iterator of owned items */ }
impl<T> Into<Vec<T>> for Reservoir<T> { /* equivalent to, but potentially more efficient than, `into_iter().collect()` */ }
// Other traits to implement?Again, I'm not sure what to name these types. The best descriptive names I can come up with would be along the lines of Uniform, Weighted, and ProportionallyWeighted, but there's the unfortunate name clashing. Naming the structs after the corresponding algorithms or their inventors would be an option, but that's very opaque, I feel. Suggestions would be appreciated.
Anyway, from this, we'd get:
pub trait IteratorRandom: Iterator + Sized {
/* other methods omitted */
fn choose_multiple<R>(mut self, rng: &mut R, amount: usize) -> Vec<Self::Item>
where
R: Rng + ?Sized,
{
let mut reservoir = crate::reservoir::Uniform::with_sample_size(amount);
for item in self {
reservoir.feed(rng, item);
}
reservoir.into()
}
// and analogously:
fn choose_multiple_weighted<R, F>(mut self, rng: &mut R, amount: usize, weight: F) -> Vec<Self::Item> where ...
fn choose_multiple_weighted_proportionally<R, F>(/* ... */) -> Vec<Self::item> where ...
}Alternatives and Concluding Notes
I guess there's a case to be made for making a new trait ReservoirSampler (or similar) that all three structs would implement, and making the choose_multiple method generic over the reservoir type, but that seems over-engineered to me.
I think it's also worth discussing whether to also extend the SliceRandom trait. If I'm reading the implementation of SliceRandom::choose_multiple_weighted right, it's equivalent to slice.iter().choose_multiple_weighted(), so maybe it should also get an equivalent to slice.iter().choose_multiple_weighted_proportionally()? That doesn't seem necessary to me either, but users may expect that kind of symmetry.
I could probably bang out a pull request for all this relatively quickly - I already have working, if rudimentary implementations, after all - , I just wanted to gauge interest first, and get some opinions on these surface level questions.