Skip to content

Support grouped aggregates with known min/max statistics #19938

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge?

Currently, grouped aggregates follow this path (simplified)

  • create hashes for columns
  • group by hash using a hash table / check equality

The approach is well optimized, but we can avoid a lot of work if we don't have to hash and use a hashtable.

Describe the solution you'd like

When the column statistics includinf the range (min/max) s known for a group by column, and the range is not too large, we can store the groups in a Vec where each element at i represents the group min + i, using direct indexing.
This could save a lot of overhead.
This is very similar to whats implemented in #19411 for joins.

Describe alternatives you've considered

We could also consider computing the statistics on the fly and switch dynamically to a hash table vs hash map (i.e. copy all entries to a hash table once the range exceeds the maximum).

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceMake DataFusion faster

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions