Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 23, 2026

🥞 Stacked PR

Use this link to review incremental changes.


  • Add NullBuffer mask parameter to update()
  • Only count masked-in rows for numRecords
  • Only count nulls in masked-in rows for nullCount
  • Filter column by mask before computing min/max
  • Tests for mask behavior with min/max and null counting

This enables deletion vector support where masked-out rows
should not contribute to file statistics.

What changes are proposed in this pull request?

How was this change tested?

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 70.96774% with 270 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.28%. Comparing base (d4ecc0a) to head (fc10d58).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/stats.rs 67.76% 228 Missing and 26 partials ⚠️
kernel/src/transaction/mod.rs 61.90% 8 Missing ⚠️
kernel/src/table_configuration.rs 50.00% 5 Missing ⚠️
kernel/src/scan/data_skipping/stats_schema.rs 97.40% 1 Missing and 1 partial ⚠️
kernel/src/snapshot.rs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1665      +/-   ##
==========================================
- Coverage   84.65%   84.28%   -0.37%     
==========================================
  Files         123      125       +2     
  Lines       34109    35333    +1224     
  Branches    34109    35333    +1224     
==========================================
+ Hits        28875    29781     +906     
- Misses       3905     4186     +281     
- Partials     1329     1366      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin force-pushed the stack/stats-collector-mask branch 7 times, most recently from ed3c8f9 to 226bff5 Compare January 23, 2026 23:26
- Add stats_columns parameter to write_parquet_file trait
- Add stats_schema(), stats_columns(), get_clustering_columns() to Transaction
- Add stats_columns to WriteContext
- Update get_write_context() to take engine parameter
- Add clustering column support to expected_stats_schema()
- Add StatisticsCollector struct with new(), update(), finalize()
- Track numRecords across multiple RecordBatches
- Output StructArray with {numRecords, tightBounds}
- Basic unit tests for single/multiple batches

This is the foundation for full stats collection, adding column-level
stats (nullCount, minValues, maxValues) in subsequent PRs.
- Add null count tracking for all columns
- Support nested struct null counts
- Merge null counts across multiple batches
- Only collect for columns in stats_columns
- Tests for null counting across batches
- Add min/max tracking for all supported types
- Primitive types (int8-64, uint8-64, float32/64)
- Date, timestamp with all time units
- Decimal128
- String types with truncation to 32 chars
- Merge min/max across multiple batches
- Tests for min/max across single and multiple batches
- Add NullBuffer mask parameter to update()
- Only count masked-in rows for numRecords
- Only count nulls in masked-in rows for nullCount
- Filter column by mask before computing min/max
- Tests for mask behavior with min/max and null counting

This enables deletion vector support where masked-out rows
should not contribute to file statistics.
@DrakeLin DrakeLin force-pushed the stack/stats-collector-mask branch from 226bff5 to fc10d58 Compare January 23, 2026 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant