Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 23, 2026

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

  • Implements collect_stats(batch, stats_columns) function that returns a StructArray with basic stats
  • Currently returns numRecords (row count) and tightBounds (always true for new writes)

This is the foundation for statistics collection, with nullCount and minValues/maxValues to be added in subsequent PRs.

How was this change tested?

New unit tests

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 94.79167% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.71%. Comparing base (d4ecc0a) to head (e361bff).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/table_configuration.rs 50.00% 5 Missing ⚠️
kernel/src/transaction/mod.rs 90.00% 2 Missing and 2 partials ⚠️
kernel/src/engine/default/parquet.rs 97.53% 0 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/stats_schema.rs 97.40% 1 Missing and 1 partial ⚠️
kernel/src/engine/default/stats.rs 97.87% 0 Missing and 1 partial ⚠️
kernel/src/snapshot.rs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1662      +/-   ##
==========================================
+ Coverage   84.65%   84.71%   +0.06%     
==========================================
  Files         123      126       +3     
  Lines       34109    34907     +798     
  Branches    34109    34907     +798     
==========================================
+ Hits        28875    29573     +698     
- Misses       3905     3981      +76     
- Partials     1329     1353      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@DrakeLin DrakeLin marked this pull request as ready for review January 23, 2026 08:06
@DrakeLin DrakeLin force-pushed the stack/stats-collector-core branch 5 times, most recently from ee9ac0c to 8096eb1 Compare January 23, 2026 23:26
@DrakeLin DrakeLin force-pushed the stack/stats-collector-core branch 2 times, most recently from 2942bec to c02a485 Compare January 23, 2026 23:58
- Add stats_columns parameter to write_parquet_file trait
- Add stats_schema(), stats_columns(), get_clustering_columns() to Transaction
- Add stats_columns to WriteContext
- Update get_write_context() to take engine parameter
- Add clustering column support to expected_stats_schema()
@DrakeLin DrakeLin force-pushed the stack/stats-collector-core branch from c02a485 to 1ce48c3 Compare January 24, 2026 01:12
@DrakeLin DrakeLin marked this pull request as draft January 24, 2026 01:14
@DrakeLin DrakeLin force-pushed the stack/stats-collector-core branch from 1ce48c3 to 5c2a74d Compare January 24, 2026 03:31
- Add StatisticsCollector struct with new(), update(), finalize()
- Track numRecords across multiple RecordBatches
- Output StructArray with {numRecords, tightBounds}
- Basic unit tests for single/multiple batches

This is the foundation for full stats collection, adding column-level
stats (nullCount, minValues, maxValues) in subsequent PRs.
@DrakeLin DrakeLin force-pushed the stack/stats-collector-core branch from 5c2a74d to e361bff Compare January 24, 2026 03:46
@DrakeLin DrakeLin requested a review from nicklan January 25, 2026 09:59
@DrakeLin DrakeLin marked this pull request as ready for review January 25, 2026 23:48
@DrakeLin DrakeLin requested a review from dengsh12 January 25, 2026 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant