Skip to content

Enhance partition pruning by leveraging min/max statistics from non-partition columns #25896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
PawanMahalleTech opened this issue Jun 1, 2025 · 1 comment

Comments

@PawanMahalleTech
Copy link

Issue Description:

Problem Statement

Currently, Trino performs partition pruning based on partition column values, and separately performs file-level pruning based on min/max
statistics.

However, there's an opportunity to enhance partition pruning by using min/max statistics from non-partition columns to eliminate
entire partitions when possible.

This is particularly relevant for Hive-compatible tables where high-cardinality columns (like user_id or product_id) are typically not used as
partition columns due to the limitations of the Hive partitioning model, which creates a separate directory for each partition value. File-level pruning need split creation which is an overhead when all files are being skipped from entire partition.

Proposed Enhancement

Enhance the partition pruning mechanism to leverage min/max statistics collected from non-partition columns to skip entire partitions when the
query predicate cannot be satisfied by any file in that partition.

Example Scenario

Consider a table partitioned by date with the following structure:
• Partition column: date (low cardinality, suitable for Hive partitioning)
• Non-partition columns: user_id (high cardinality, not suitable for Hive partitioning), product_id, amount

Current partitions:
• date=2023-01-01 (contains user_ids from 1-1000)
• date=2023-01-02 (contains user_ids from 1001-2000)
• date=2023-01-03 (contains user_ids from 2001-3000)

When executing a query like:
sql
SELECT * FROM orders
WHERE date BETWEEN '2023-01-01' AND '2023-01-03'
AND user_id = 500;

Current behavior:
• Trino scans all three partitions because they match the date predicate
• Then applies file-level pruning within each partition based on user_id statistics

Proposed behavior:
• Trino would recognize that only the partition date=2023-01-01 contains files with user_id=500 (based on min/max statistics)
• Partitions date=2023-01-02 and date=2023-01-03 would be pruned entirely without reading any files

Benefits

  1. Reduced I/O operations by eliminating unnecessary partition scans
  2. Improved query performance, especially for large datasets with many partitions
  3. Better resource utilization across the cluster
  4. Addresses a fundamental limitation in Hive's partitioning model, which cannot efficiently handle high-cardinality columns

Implementation Considerations

  1. Need to aggregate min/max statistics at the partition level for non-partition columns
  2. Consider storage and maintenance overhead for these additional statistics
  3. Ensure statistics are updated appropriately during data modifications
  4. Maintain compatibility with existing Hive metadata and partitioning schemes

Relation to Hive Partitioning Limitations

In Hive-compatible storage, partitioning is typically limited to low-cardinality columns because:

  1. Each partition value creates a separate directory in the filesystem
  2. Too many partitions (from high-cardinality columns) can overwhelm the filesystem and metadata store
  3. Small files problem occurs when high-cardinality columns create many small partitions

This enhancement would provide some of the performance benefits of partitioning on high-cardinality columns without the associated storage and
management drawbacks. It effectively creates a "virtual partitioning" scheme based on statistics rather than physical directory structures.

This would be particularly valuable for tables with many partitions where each partition contains data with distinct ranges of values in non-
partition columns, allowing users to get partition-level pruning benefits for columns that are impractical to use as actual partition keys.

@PawanMahalleTech
Copy link
Author

PawanMahalleTech commented Jun 1, 2025

Thanks @raunaqmorarka! Got answer on Trino #begineer channel.

Why this feature cannot be not supported for Hive catalog?

Hive table statistics are not guaranteed to be in sync with the actual data, so we don't use them for predicate pruning to avoid correctness issues.
You will get the same pruning from min/max indexes at parquet/orc reader layer anyway.
If you use iceberg or delta lake instead, those will do additional predicate pruning of files based on min/max stats on non-partitioned columns in the coordinator.

@raunaqmorarka raunaqmorarka closed this as not planned Won't fix, can't repro, duplicate, stale Jun 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants