Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 23, 2026

🥞 Stacked PR

Use this link to review incremental changes.

What changes are proposed in this pull request?

  • Adds stats validation in kernel core to verify collected statistics before commit.
  • Add StatsVerifier in default engine - inspects StructArray stats structure
  • Add StatsValidationVisitor in transaction using RowVisitor pattern - validates stats in EngineData
  • Integrate validation into commit() - automatically validates all add file metadata
  • Add add_files_validated() helper for explicit validation during adds

How was this change tested?

Unit tests

- Add stats_columns parameter to write_parquet_file trait
- Add stats_schema(), stats_columns(), get_clustering_columns() to Transaction
- Add stats_columns to WriteContext
- Update get_write_context() to take engine parameter
- Add clustering column support to expected_stats_schema()
@DrakeLin DrakeLin force-pushed the stack/stats-validation branch from fc4424d to df51202 Compare January 23, 2026 05:24
- Add StatisticsCollector struct with new(), update(), finalize()
- Track numRecords across multiple RecordBatches
- Output StructArray with {numRecords, tightBounds}
- Basic unit tests for single/multiple batches

This is the foundation for full stats collection, adding column-level
stats (nullCount, minValues, maxValues) in subsequent PRs.
- Add null count tracking for all columns
- Support nested struct null counts
- Merge null counts across multiple batches
- Only collect for columns in stats_columns
- Tests for null counting across batches
- Add min/max tracking for all supported types
- Primitive types (int8-64, uint8-64, float32/64)
- Date, timestamp with all time units
- Decimal128
- String types with truncation to 32 chars
- Merge min/max across multiple batches
- Tests for min/max across single and multiple batches
- Add NullBuffer mask parameter to update()
- Only count masked-in rows for numRecords
- Only count nulls in masked-in rows for nullCount
- Filter column by mask before computing min/max
- Tests for mask behavior with min/max and null counting

This enables deletion vector support where masked-out rows
should not contribute to file statistics.
- Import StatisticsCollector in parquet.rs
- Add stats field to DataFileMetadata with with_stats() method
- Update as_record_batch to use full stats if available
- Update write_parquet_file to collect and attach stats
- Update mod.rs write_parquet to pass stats_columns
- Update write tests to expect full stats output
- Fix write-table example API call
- Add StatsVerifier with verify() and verify_detailed() methods
- Add StatsValidationVisitor using RowVisitor pattern
- Integrate validation into commit() flow
- Add add_files_validated() and validate_add_files_stats() helpers
- Unit tests for verifier functionality
@DrakeLin DrakeLin force-pushed the stack/stats-validation branch from df51202 to 00ccd2a Compare January 23, 2026 06:28
@DrakeLin DrakeLin closed this Jan 23, 2026
@DrakeLin DrakeLin deleted the stack/stats-validation branch January 23, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant