Skip to content

Commit 092ee67

Browse files
authored
Utility trait for stats-based skipping logic (#357)
Parquet footer stats allow data skipping, very similar to Delta file stats. Except parquet isn't quite as convenient to work with and arrow-parquet doesn't even try to help (it can't, because arrow-compute expressions are opaque, so there's no way to traverse and rewrite them into stats-based skipping predicates). We implement row group skipping support by traversing the same push-down predicate that delta-kernel already uses to extract a for Delta file skipping predicate. But instead of rewriting the expression, we evaluate it bottom-up (no-copy, O(n) work where n is the number of nodes in the expression). This PR does not attempt to actually incorporate the new skipping logic into the default reader. That (plus testing the integration) should be a follow-up PR.
1 parent c81da02 commit 092ee67

File tree

9 files changed

+1415
-12
lines changed

9 files changed

+1415
-12
lines changed

kernel/src/engine/mod.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ pub mod arrow_expression;
1111
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
1212
pub mod arrow_data;
1313

14+
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
15+
pub mod parquet_stats_skipping;
16+
1417
#[cfg(any(feature = "default-engine", feature = "sync-engine"))]
1518
pub(crate) mod arrow_get_data;
1619

kernel/src/engine/parquet_stats_skipping.rs

Lines changed: 406 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)