-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Open
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge?
Currently, grouped aggregates follow this path (simplified)
- create hashes for columns
- group by hash using a hash table / check equality
The approach is well optimized, but we can avoid a lot of work if we don't have to hash and use a hashtable.
Describe the solution you'd like
When the column statistics includinf the range (min/max) s known for a group by column, and the range is not too large, we can store the groups in a Vec where each element at i represents the group min + i, using direct indexing.
This could save a lot of overhead.
This is very similar to whats implemented in #19411 for joins.
Describe alternatives you've considered
We could also consider computing the statistics on the fly and switch dynamically to a hash table vs hash map (i.e. copy all entries to a hash table once the range exceeds the maximum).
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster