Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 23, 2026

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

  • Adds minValues and maxValues statistics to collect_stats
  • Uses Arrow compute kernels (min, max, min_string, max_string) for efficient aggregation
  • Supports nested structs with recursive min/max computation

How was this change tested?

new unit tests

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 86.63854% with 95 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.66%. Comparing base (d4ecc0a) to head (ec7ddd7).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/stats.rs 82.76% 64 Missing and 17 partials ⚠️
kernel/src/table_configuration.rs 50.00% 5 Missing ⚠️
kernel/src/transaction/mod.rs 90.00% 2 Missing and 2 partials ⚠️
kernel/src/engine/default/parquet.rs 97.53% 0 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/stats_schema.rs 97.40% 1 Missing and 1 partial ⚠️
kernel/src/snapshot.rs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1664      +/-   ##
==========================================
+ Coverage   84.65%   84.66%   +0.01%     
==========================================
  Files         123      126       +3     
  Lines       34109    35330    +1221     
  Branches    34109    35330    +1221     
==========================================
+ Hits        28875    29913    +1038     
- Misses       3905     4046     +141     
- Partials     1329     1371      +42     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin force-pushed the stack/stats-collector-minmax branch 7 times, most recently from e3e39b6 to 5f7997f Compare January 23, 2026 23:26
@DrakeLin DrakeLin force-pushed the stack/stats-collector-minmax branch from 5f7997f to 628dafd Compare January 23, 2026 23:29
- Add stats_columns parameter to write_parquet_file trait
- Add stats_schema(), stats_columns(), get_clustering_columns() to Transaction
- Add stats_columns to WriteContext
- Update get_write_context() to take engine parameter
- Add clustering column support to expected_stats_schema()
- Add StatisticsCollector struct with new(), update(), finalize()
- Track numRecords across multiple RecordBatches
- Output StructArray with {numRecords, tightBounds}
- Basic unit tests for single/multiple batches

This is the foundation for full stats collection, adding column-level
stats (nullCount, minValues, maxValues) in subsequent PRs.
@DrakeLin DrakeLin force-pushed the stack/stats-collector-minmax branch from 628dafd to 12cd31d Compare January 25, 2026 09:14
@DrakeLin DrakeLin force-pushed the stack/stats-collector-minmax branch from 12cd31d to ec7ddd7 Compare January 25, 2026 09:33
@DrakeLin DrakeLin requested review from dengsh12 and nicklan January 25, 2026 23:49
@DrakeLin DrakeLin marked this pull request as ready for review January 25, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant