Enhance partition pruning by leveraging min/max statistics from non-partition columns #25896

PawanMahalleTech · 2025-06-01T16:08:34Z

Issue Description:

Problem Statement

Currently, Trino performs partition pruning based on partition column values, and separately performs file-level pruning based on min/max
statistics.

However, there's an opportunity to enhance partition pruning by using min/max statistics from non-partition columns to eliminate
entire partitions when possible.

This is particularly relevant for Hive-compatible tables where high-cardinality columns (like user_id or product_id) are typically not used as
partition columns due to the limitations of the Hive partitioning model, which creates a separate directory for each partition value. File-level pruning need split creation which is an overhead when all files are being skipped from entire partition.

Proposed Enhancement

Enhance the partition pruning mechanism to leverage min/max statistics collected from non-partition columns to skip entire partitions when the
query predicate cannot be satisfied by any file in that partition.

Example Scenario

Consider a table partitioned by date with the following structure:
• Partition column: date (low cardinality, suitable for Hive partitioning)
• Non-partition columns: user_id (high cardinality, not suitable for Hive partitioning), product_id, amount

Current partitions:
• date=2023-01-01 (contains user_ids from 1-1000)
• date=2023-01-02 (contains user_ids from 1001-2000)
• date=2023-01-03 (contains user_ids from 2001-3000)

When executing a query like:
sql
SELECT * FROM orders
WHERE date BETWEEN '2023-01-01' AND '2023-01-03'
AND user_id = 500;

Current behavior:
• Trino scans all three partitions because they match the date predicate
• Then applies file-level pruning within each partition based on user_id statistics

Proposed behavior:
• Trino would recognize that only the partition date=2023-01-01 contains files with user_id=500 (based on min/max statistics)
• Partitions date=2023-01-02 and date=2023-01-03 would be pruned entirely without reading any files

Benefits

Reduced I/O operations by eliminating unnecessary partition scans
Improved query performance, especially for large datasets with many partitions
Better resource utilization across the cluster
Addresses a fundamental limitation in Hive's partitioning model, which cannot efficiently handle high-cardinality columns

Implementation Considerations

Need to aggregate min/max statistics at the partition level for non-partition columns
Consider storage and maintenance overhead for these additional statistics
Ensure statistics are updated appropriately during data modifications
Maintain compatibility with existing Hive metadata and partitioning schemes

Relation to Hive Partitioning Limitations

In Hive-compatible storage, partitioning is typically limited to low-cardinality columns because:

Each partition value creates a separate directory in the filesystem
Too many partitions (from high-cardinality columns) can overwhelm the filesystem and metadata store
Small files problem occurs when high-cardinality columns create many small partitions

This enhancement would provide some of the performance benefits of partitioning on high-cardinality columns without the associated storage and
management drawbacks. It effectively creates a "virtual partitioning" scheme based on statistics rather than physical directory structures.

This would be particularly valuable for tables with many partitions where each partition contains data with distinct ranges of values in non-
partition columns, allowing users to get partition-level pruning benefits for columns that are impractical to use as actual partition keys.

PawanMahalleTech · 2025-06-01T22:35:34Z

Thanks @raunaqmorarka! Got answer on Trino #begineer channel.

Why this feature cannot be not supported for Hive catalog?

Hive table statistics are not guaranteed to be in sync with the actual data, so we don't use them for predicate pruning to avoid correctness issues.
You will get the same pruning from min/max indexes at parquet/orc reader layer anyway.
If you use iceberg or delta lake instead, those will do additional predicate pruning of files based on min/max stats on non-partitioned columns in the coordinator.

raunaqmorarka closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance partition pruning by leveraging min/max statistics from non-partition columns #25896

Enhance partition pruning by leveraging min/max statistics from non-partition columns #25896

PawanMahalleTech commented Jun 1, 2025

PawanMahalleTech commented Jun 1, 2025 •

edited

Loading

Uh oh!

Enhance partition pruning by leveraging min/max statistics from non-partition columns #25896

Enhance partition pruning by leveraging min/max statistics from non-partition columns #25896

Comments

PawanMahalleTech commented Jun 1, 2025

Issue Description:

Problem Statement

Proposed Enhancement

Example Scenario

Benefits

Implementation Considerations

Relation to Hive Partitioning Limitations

PawanMahalleTech commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PawanMahalleTech commented Jun 1, 2025 •

edited

Loading