Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuple.collectors() should be able to share state #344

Open
lukaseder opened this issue Feb 12, 2019 · 8 comments
Open

Tuple.collectors() should be able to share state #344

lukaseder opened this issue Feb 12, 2019 · 8 comments

Comments

@lukaseder
Copy link
Member

The current Tuple.collectors() implementation is a canonical one, where collectors are completely independent of one another. It would be better if they could share state in case they're related. For example, when calculating a set of percentiles:

var percentiles =
Stream.of(1, 2, 3, 4, 10, 9, 3, 3).collect(
  Tuple.collectors(
    Agg.<Integer>percentile(0.0),
    Agg.<Integer>percentile(0.25),
    Agg.<Integer>percentile(0.5),
    Agg.<Integer>percentile(0.75),
    Agg.<Integer>percentile(1.0)
  )
);
 
System.out.println(percentiles);

It would be great if the 5 collectors could somehow share their internal state, which is a sorted list of all values in the stream. Executing 1 sort instead of 5 is definitely preferrable.

This can be implemented in 2 ways:

  • Optimising for specific collector usage
  • Optimising this in general
@rwperrott

This comment has been minimized.

@rwperrott

This comment has been minimized.

@lukaseder

This comment has been minimized.

@rwperrott
Copy link

rwperrott commented Jan 13, 2021

BiFunction<List<Tuple2<T, U>>, Function<? super T, ? extends U>, ? extends R1> sf1

could be declared as a sub-class of BiFunction, like:

@FunctionalInterface
public interface Tuple2ListMapping<T,U,R> 
extends BiFunction<List<Tuple2<T, U>>, Function<? super T, ? extends U>, ? extends R> {}

allowing much shorter column declarations, like

Tuple2ListMapping<T,U,R> sf1

I don't think that column source sharing needs to be complex, because id binding is simpler and is probably safe than introspection guesses:

  • Specify each result column with a String id.
  • The first result column with a unique id, causes a base collector to be created, wrapped in a collector which stores the base collector finisher result, andThen returns the first column result, with this wrapper collector mapped by id inside a context object, possibly Seq. This could allow a separate map for Window, assuming it has a subset Seq, so prevent out-of-context id use.
  • All the later result columns with a used id would have secondary collectors created by the primary collector, using the primary saved base collector finisher result as input for it's finisher function.
  • If other than a list type collector needs to be supported (e.g. for non-percentile statistics), base collector result type validation would be needed to be before secondary collector creation, with all id columns specifying the required type.

Something like this may work (not tested yet):

public class IdCollector<T,A,R,RR> implements Collector<T,A,RR> {
    final Collector<T,A,R> baseCollector;
    final Function<R,RR> primaryFinisher;
    private R r; // baseCollector.finisher().apply(a) result.

    public IdCollector(final Collector<T, A, R> baseCollector, final Function<R, RR> primaryFinisher) {
        this.baseCollector= baseCollector;
        this.primaryFinisher= primaryFinisher;
    }
 
    // Assumes that created secondary collector finisher() will always be called after primary collector finisher()
    public Collector<T,A,RR> secondary(final Function<R,RR> secondaryFinisher) {
        return Collector.of(
                ()->null,
                (a,t) -> {},
                (a1,a2) -> null,
                a -> secondaryFinisher.apply(r));
    }

    @Override
    public Supplier<A> supplier() {
        return baseCollector.supplier();
    }

    @Override
    public BiConsumer<A, T> accumulator() {
        return baseCollector.accumulator();
    }

    @Override
    public BinaryOperator<A> combiner() {
        return baseCollector.combiner();
    }

    @Override
    public Function<A, RR> finisher() {
        return baseCollector.finisher()
                        .andThen(r -> primaryFinisher.apply(this.r = r));
    }

    @Override
    public Set<Characteristics> characteristics() {
        return baseCollector.characteristics();
    }
}

@rwperrott
Copy link

rwperrott commented Jan 24, 2021

For *By columns this would be most efficient if the T to U keyExtractor function and keyComparator are both reference equal for all columns, to allow sharing a collected, the pre-sorted List, sorted by the same U values, otherwise the list would need to be copied and resorted for each column.

It looks more memory efficient to base the *By functionality off a list of the T value, then using keyComparator with the keyExtractor e.g via this method in Comparator (Java 1.8+)

    public static <T, U> Comparator<T> comparing(
            Function<? super T, ? extends U> keyExtractor,
            Comparator<? super U> keyComparator)

Likewise if the List collector based sourced columns Comparator must be reference equal for all columns, otherwise the list would need to be copied and resorted for each column.

@rwperrott
Copy link

rwperrott commented Jan 28, 2021

See my latest Sharable Collector code, with support for reference and mapped id based results sharing at https://github.com/rwperrott/sharable-collector

@lukaseder
Copy link
Member Author

@rwperrott I appreciate your efforts, thank you very much. To make sure expectations are set right, I currently cannot invest much time in jOOL. This includes reviewing your various messages and code (as this is by no means a trivial task).

At some point in the future, I'll be able to allocate some dedicate time to see what's currently open and missing, but I cannot make any promises.

@rwperrott
Copy link

@lukaseder Thanks. I expected you were busy. given prior feedback. I also developed this speculative solution because it's a fun exercise and I think it'll be useful for my projects too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants