Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 20, 2026

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

This PR enables reading pre-parsed statistics (stats_parsed) directly from checkpoint files for data skipping, avoiding JSON parsing overhead.

DataSkippingFilter Updates

  • Add select_stats_parsed_evaluator for extracting add.stats_parsed from checkpoint batches
  • Update constructor to accept checkpoint_read_schema and has_compatible_stats_parsed parameters
  • Update apply() to take is_log_batch parameter:
  • Commit batches (JSON log files): Always parse JSON stats
  • Checkpoint batches with compatible stats_parsed: Use stats_parsed directly
  • Checkpoint batches without stats_parsed: Parse JSON stats

How was this change tested?

Add new test table tests/data/parsed-stats/ with:

  • 6 data files (100 rows each, ids 1-600)
  • Checkpoint at version 3 with stats_parsed
  • 2 commit files after checkpoint

Integration Tests

  • Add data_skipping_with_parsed_stats test verifying:
  • Skip all files when predicate matches none
  • Skip to single file for highly selective predicates
  • Correct file counts for various predicates on both checkpoint and commit files

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 89.74359% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.70%. Comparing base (d4ecc0a) to head (5bbf538).

Files with missing lines Patch % Lines
kernel/src/log_segment.rs 87.83% 9 Missing ⚠️
kernel/src/scan/data_skipping.rs 77.50% 3 Missing and 6 partials ⚠️
kernel/src/scan/mod.rs 85.71% 3 Missing and 3 partials ⚠️
kernel/src/scan/log_replay.rs 97.11% 1 Missing and 2 partials ⚠️
kernel/src/parallel/sequential_phase.rs 80.00% 0 Missing and 2 partials ⚠️
kernel/src/scan/data_skipping/stats_schema.rs 85.71% 0 Missing and 2 partials ⚠️
kernel/src/table_changes/log_replay.rs 90.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1639      +/-   ##
==========================================
+ Coverage   84.65%   84.70%   +0.05%     
==========================================
  Files         123      123              
  Lines       34109    34328     +219     
  Branches    34109    34328     +219     
==========================================
+ Hits        28875    29079     +204     
- Misses       3905     3919      +14     
- Partials     1329     1330       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Jan 20, 2026
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch 2 times, most recently from 28acc14 to f0018a5 Compare January 20, 2026 21:14
@DrakeLin DrakeLin changed the title ps feature: fully integrate parsed stats read Jan 20, 2026
@DrakeLin DrakeLin changed the title feature: fully integrate parsed stats read feature: integrate parsed stats with data skipping Jan 20, 2026
@DrakeLin DrakeLin changed the title feature: integrate parsed stats with data skipping feat: integrate parsed stats with data skipping Jan 20, 2026
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch from f0018a5 to a2ebc6b Compare January 20, 2026 21:42
@DrakeLin DrakeLin marked this pull request as ready for review January 20, 2026 21:42
@DrakeLin DrakeLin requested review from dengsh12 and nicklan January 20, 2026 21:42
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch from a2ebc6b to f0e4680 Compare January 20, 2026 22:32
DrakeLin added a commit that referenced this pull request Jan 21, 2026
)

## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1635/files) to
review incremental changes.
-
[**stack/propagate-nulls**](#1635)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1635/files)]
-
[stack/nullable-transform](#1636)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1636/files/b3b511b3b11aada328cf613b92bc851b3095efaf..0568a9b8b1ffaa6d6128fd05abff3f6820e1fd76)]
-
[stack/has_compatible_parsed_stats](#1638)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1638/files/0568a9b8b1ffaa6d6128fd05abff3f6820e1fd76..f840878ae0ca1db97d51512dca0ba7dff1b8371e)]
-
[stack/read-parsed-stats](#1639)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1639/files/f840878ae0ca1db97d51512dca0ba7dff1b8371e..f0e4680cab79fb7d6878cda709fc55119449d680)]

---------
## What changes are proposed in this pull request?
Change apply_schema to propagate top-level struct nulls to child columns
instead of erroring

- Remove the error check for top-level nulls in apply_schema
- Document that child columns are expected to already have nulls
propagated (Arrow's JSON reader does this automatically, and parquet
data goes through fix_nested_null_masks)
- Add comprehensive test case test_apply_schema_handles_top_level_null
## How was this change tested?
Edited unit tests

Added unit test to show new behavior
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch from f0e4680 to dff4b57 Compare January 21, 2026 20:10
DrakeLin added a commit that referenced this pull request Jan 21, 2026
## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1636/files) to
review incremental changes.
-
[**stack/nullable-transform**](#1636)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1636/files)]
-
[stack/has_compatible_parsed_stats](#1638)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1638/files/16e428c2fe7a256de0ddb852a11475d7ea131769..a9b9b115b2dd7289447b94bf7646872ee049fec6)]
-
[stack/read-parsed-stats](#1639)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1639/files/a9b9b115b2dd7289447b94bf7646872ee049fec6..dff4b57e26b9819a18b29d85d2860283c8ad04c0)]

---------
## What changes are proposed in this pull request?

This PR consolidates duplicated NullableStatsTransform and
NullCountStatsTransform implementations into a single shared location.
<!--
**Uncomment** this section if there are any changes affecting public
APIs. Else, **delete** this section.
### This PR affects the following public APIs
If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.
Note that _new_ public APIs are not considered breaking.
-->

## How was this change tested?
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch 3 times, most recently from 5387120 to 0e03b17 Compare January 23, 2026 05:09
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch 5 times, most recently from 5d21016 to 5bbf538 Compare January 23, 2026 07:21
@DrakeLin DrakeLin force-pushed the stack/read-parsed-stats branch from 5bbf538 to 6b96787 Compare January 23, 2026 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant