You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Trino performs partition pruning based on partition column values, and separately performs file-level pruning based on min/max
statistics.
However, there's an opportunity to enhance partition pruning by using min/max statistics from non-partition columns to eliminate
entire partitions when possible.
This is particularly relevant for Hive-compatible tables where high-cardinality columns (like user_id or product_id) are typically not used as
partition columns due to the limitations of the Hive partitioning model, which creates a separate directory for each partition value. File-level pruning need split creation which is an overhead when all files are being skipped from entire partition.
Proposed Enhancement
Enhance the partition pruning mechanism to leverage min/max statistics collected from non-partition columns to skip entire partitions when the
query predicate cannot be satisfied by any file in that partition.
Example Scenario
Consider a table partitioned by date with the following structure:
• Partition column: date (low cardinality, suitable for Hive partitioning)
• Non-partition columns: user_id (high cardinality, not suitable for Hive partitioning), product_id, amount
Current partitions:
• date=2023-01-01 (contains user_ids from 1-1000)
• date=2023-01-02 (contains user_ids from 1001-2000)
• date=2023-01-03 (contains user_ids from 2001-3000)
When executing a query like:
sql
SELECT * FROM orders
WHERE date BETWEEN '2023-01-01' AND '2023-01-03'
AND user_id = 500;
Current behavior:
• Trino scans all three partitions because they match the date predicate
• Then applies file-level pruning within each partition based on user_id statistics
Proposed behavior:
• Trino would recognize that only the partition date=2023-01-01 contains files with user_id=500 (based on min/max statistics)
• Partitions date=2023-01-02 and date=2023-01-03 would be pruned entirely without reading any files
Benefits
Reduced I/O operations by eliminating unnecessary partition scans
Improved query performance, especially for large datasets with many partitions
Better resource utilization across the cluster
Addresses a fundamental limitation in Hive's partitioning model, which cannot efficiently handle high-cardinality columns
Implementation Considerations
Need to aggregate min/max statistics at the partition level for non-partition columns
Consider storage and maintenance overhead for these additional statistics
Ensure statistics are updated appropriately during data modifications
Maintain compatibility with existing Hive metadata and partitioning schemes
Relation to Hive Partitioning Limitations
In Hive-compatible storage, partitioning is typically limited to low-cardinality columns because:
Each partition value creates a separate directory in the filesystem
Too many partitions (from high-cardinality columns) can overwhelm the filesystem and metadata store
Small files problem occurs when high-cardinality columns create many small partitions
This enhancement would provide some of the performance benefits of partitioning on high-cardinality columns without the associated storage and
management drawbacks. It effectively creates a "virtual partitioning" scheme based on statistics rather than physical directory structures.
This would be particularly valuable for tables with many partitions where each partition contains data with distinct ranges of values in non-
partition columns, allowing users to get partition-level pruning benefits for columns that are impractical to use as actual partition keys.
The text was updated successfully, but these errors were encountered:
Thanks @raunaqmorarka! Got answer on Trino #begineer channel.
Why this feature cannot be not supported for Hive catalog?
Hive table statistics are not guaranteed to be in sync with the actual data, so we don't use them for predicate pruning to avoid correctness issues.
You will get the same pruning from min/max indexes at parquet/orc reader layer anyway.
If you use iceberg or delta lake instead, those will do additional predicate pruning of files based on min/max stats on non-partitioned columns in the coordinator.
Issue Description:
Problem Statement
Currently, Trino performs partition pruning based on partition column values, and separately performs file-level pruning based on min/max
statistics.
However, there's an opportunity to enhance partition pruning by using min/max statistics from non-partition columns to eliminate
entire partitions when possible.
This is particularly relevant for Hive-compatible tables where high-cardinality columns (like user_id or product_id) are typically not used as
partition columns due to the limitations of the Hive partitioning model, which creates a separate directory for each partition value. File-level pruning need split creation which is an overhead when all files are being skipped from entire partition.
Proposed Enhancement
Enhance the partition pruning mechanism to leverage min/max statistics collected from non-partition columns to skip entire partitions when the
query predicate cannot be satisfied by any file in that partition.
Example Scenario
Consider a table partitioned by date with the following structure:
• Partition column: date (low cardinality, suitable for Hive partitioning)
• Non-partition columns: user_id (high cardinality, not suitable for Hive partitioning), product_id, amount
Current partitions:
• date=2023-01-01 (contains user_ids from 1-1000)
• date=2023-01-02 (contains user_ids from 1001-2000)
• date=2023-01-03 (contains user_ids from 2001-3000)
When executing a query like:
sql
SELECT * FROM orders
WHERE date BETWEEN '2023-01-01' AND '2023-01-03'
AND user_id = 500;
Current behavior:
• Trino scans all three partitions because they match the date predicate
• Then applies file-level pruning within each partition based on user_id statistics
Proposed behavior:
• Trino would recognize that only the partition date=2023-01-01 contains files with user_id=500 (based on min/max statistics)
• Partitions date=2023-01-02 and date=2023-01-03 would be pruned entirely without reading any files
Benefits
Implementation Considerations
Relation to Hive Partitioning Limitations
In Hive-compatible storage, partitioning is typically limited to low-cardinality columns because:
This enhancement would provide some of the performance benefits of partitioning on high-cardinality columns without the associated storage and
management drawbacks. It effectively creates a "virtual partitioning" scheme based on statistics rather than physical directory structures.
This would be particularly valuable for tables with many partitions where each partition contains data with distinct ranges of values in non-
partition columns, allowing users to get partition-level pruning benefits for columns that are impractical to use as actual partition keys.
The text was updated successfully, but these errors were encountered: