feat: read parsed-stats from checkpoint #1638

DrakeLin · 2026-01-20T20:29:18Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/has_compatible_parsed_stats [Files changed]
- stack/read-parsed-stats [Files changed]

What changes are proposed in this pull request?

This PR adds infrastructure to detect when checkpoints have compatible pre-parsed statistics (stats_parsed) that can be used for data skipping without JSON parsing.

Added CheckpointReadInfo struct containing:

has_stats_parsed: bool - whether checkpoint has compatible pre-parsed stats
checkpoint_read_schema: SchemaRef - schema used to read checkpoint files

How was this change tested?

New and existing unit tests

codecov · 2026-01-20T20:34:51Z

Codecov Report

❌ Patch coverage is 81.66667% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.65%. Comparing base (d4ecc0a) to head (0edcfae).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/log_segment.rs	76.31%	15 Missing and 3 partials ⚠️
kernel/src/scan/data_skipping/stats_schema.rs	85.71%	0 Missing and 2 partials ⚠️
kernel/src/scan/mod.rs	86.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             main    #1638    +/-   ##
========================================
  Coverage   84.65%   84.65%            
========================================
  Files         123      124     +1     
  Lines       34109    34363   +254     
  Branches    34109    34363   +254     
========================================
+ Hits        28875    29091   +216     
- Misses       3905     3939    +34     
- Partials     1329     1333     +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kernel/src/scan/log_replay.rs

) ## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/1635/files) to review incremental changes. - [**stack/propagate-nulls**](#1635) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1635/files)] - [stack/nullable-transform](#1636) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1636/files/b3b511b3b11aada328cf613b92bc851b3095efaf..0568a9b8b1ffaa6d6128fd05abff3f6820e1fd76)] - [stack/has_compatible_parsed_stats](#1638) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1638/files/0568a9b8b1ffaa6d6128fd05abff3f6820e1fd76..f840878ae0ca1db97d51512dca0ba7dff1b8371e)] - [stack/read-parsed-stats](#1639) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1639/files/f840878ae0ca1db97d51512dca0ba7dff1b8371e..f0e4680cab79fb7d6878cda709fc55119449d680)] --------- ## What changes are proposed in this pull request? Change apply_schema to propagate top-level struct nulls to child columns instead of erroring - Remove the error check for top-level nulls in apply_schema - Document that child columns are expected to already have nulls propagated (Arrow's JSON reader does this automatically, and parquet data goes through fix_nested_null_masks) - Add comprehensive test case test_apply_schema_handles_top_level_null ## How was this change tested? Edited unit tests Added unit test to show new behavior

## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/1636/files) to review incremental changes. - [**stack/nullable-transform**](#1636) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1636/files)] - [stack/has_compatible_parsed_stats](#1638) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1638/files/16e428c2fe7a256de0ddb852a11475d7ea131769..a9b9b115b2dd7289447b94bf7646872ee049fec6)] - [stack/read-parsed-stats](#1639) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1639/files/a9b9b115b2dd7289447b94bf7646872ee049fec6..dff4b57e26b9819a18b29d85d2860283c8ad04c0)] --------- ## What changes are proposed in this pull request? This PR consolidates duplicated NullableStatsTransform and NullCountStatsTransform implementations into a single shared location.  ## How was this change tested?

dengsh12 · 2026-01-21T17:55:59Z

kernel/src/log_segment.rs

+/// Information about checkpoint reading for data skipping optimization.
+///
+/// Returned alongside the actions iterator from checkpoint reading functions.
+#[derive(Debug, Clone)]
+pub(crate) struct CheckpointReadInfo {
+    /// Whether the checkpoint has compatible pre-parsed stats for data skipping.
+    /// When `true`, checkpoint batches can use stats_parsed directly instead of parsing JSON.
+    #[allow(unused)]
+    pub has_stats_parsed: bool,
+    /// The schema used to read checkpoint files, potentially including stats_parsed.
+    #[allow(unused)]
+    pub checkpoint_read_schema: SchemaRef,
+}
+


Wonder if we really want to store has_stats_parsed seperately here, guess we can add a method to compute it? e.g. sth like

pub fn has_stats_parsed(&self) -> bool { self.checkpoint_read_schema .field("add") .and_then(|add| add.data_type().as_struct()) .is_some_and(|s| s.field("stats_parsed").is_some()) }

Feel like if we store it seperately, when we update checkpoint_read_schema somehow, we need to remember to update has_stats_parsed as well

I think we can leave that as a followup if we think necessary. I'd rather not recompute it in the DataSkipping module, seems like a waste if we already know if we have stats_parsed here. It is a valid concern though!

Make sense! SGTM on this then.

dengsh12 · 2026-01-21T23:25:43Z

kernel/src/log_segment.rs

                    Some(true) => {
                        // Hint says V2 checkpoint, extract sidecars
                        let sidecar_files = self.extract_sidecar_refs(engine, checkpoint)?;
-                        // For V2, read first sidecar's schema
+                        // For V2, read first sidecar's schema if sidecars exist,
+                        // otherwise fall back to hint schema (for empty V2 checkpoints)
                        let file_actions_schema = match sidecar_files.first() {
                            Some(first) => {
                                Some(engine.parquet_handler().read_parquet_footer(first)?.schema)
                            }
-                            None => None,
+                            None => hint_schema.cloned(),
                        };
                        Ok((file_actions_schema, sidecar_files))
                    }
                    None => {
                        // No hint, need to read parquet footer
                        let footer = engine
                            .parquet_handler()
                            .read_parquet_footer(&checkpoint.location)?;

                        if footer.schema.field(SIDECAR_NAME).is_some() {
                            // V2 parquet checkpoint
                            let sidecar_files = self.extract_sidecar_refs(engine, checkpoint)?;
+                            // For V2, read first sidecar's schema if sidecars exist,
+                            // otherwise fall back to footer schema (for empty V2 checkpoints)
                            let file_actions_schema = match sidecar_files.first() {
                                Some(first) => Some(
                                    engine.parquet_handler().read_parquet_footer(first)?.schema,
                                ),
-                                None => None,
+                                None => Some(footer.schema),
                            };


Wondering why we change these two return values, are there some special reasons here? I think for empty V2 checkpoints, the footer schema or hint_schema will not contain action schema. Returning footer.schema or hint_schema as file_actions_schema seems a small mix-up.

So i actually caught this on a test. We can have a V2 checkpoint with the following:

Has sidecar columns

Has no sidecar

Has add actions

In that case, we should just return the V2 checkpoint manifest schema

Added some comments

Oh yeah, just double-checked the protocol, it's valid case

| Note: A V2 spec Checkpoint can either have all the add and remove file actions embedded inside itself or all of them should be in sidecar files.

As it's valid that all add/remove inside the checkpoint, it may not be empty. A NIT is to change the origin comment "otherwise fall back to footer schema (for empty V2 checkpoints)"? E.g. just "otherwise fall back to footer schema "

nicklan

generally looks reasonable. had a few detail comments

kernel/src/log_segment.rs

nicklan · 2026-01-22T20:44:04Z

kernel/src/log_segment.rs

-            let DataType::Struct(values_struct) = values_field.data_type() else {
+            let DataType::Struct(checkpoint_values) = checkpoint_values_field.data_type() else {
                debug!(
                    "stats_parsed not compatible: stats_parsed.{} is not a Struct, got {:?}",


nit:

Suggested change

"stats_parsed not compatible: stats_parsed.{} is not a Struct, got {:?}",

"stats_parsed not compatible: stats_parsed. {} is not a Struct, got {:?}",

kernel/src/log_segment.rs

nicklan · 2026-01-22T20:50:28Z

kernel/src/log_segment.rs

+                .fields()
+                .map(|f| {
+                    if f.name() == "add" {
+                        new_add_field.clone()


don't think you should need to clone this

Doesn't work without it, something about .map() not knowing if there are multiple add fields in a schema or not

dengsh12

LGTM assuming the new comments resolved

kernel/src/log_segment.rs

dengsh12 · 2026-01-23T19:32:33Z

kernel/src/log_segment.rs

+        impl Iterator<Item = DeltaResult<ActionsBatch>> + Send,
+        CheckpointReadInfo,


May want to use ActionsWithCheckpointInfo here?

This was referenced Jan 20, 2026

refactor: consolidate nullable stat transforms #1636

Merged

feat: Enable Arrow to convert nullable StructArray to RecordBatch #1635

Merged

github-actions bot assigned DrakeLin Jan 20, 2026

github-actions bot added the breaking-change Change that require a major version bump label Jan 20, 2026

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch from a73cb03 to 28e1054 Compare January 20, 2026 20:55

DrakeLin mentioned this pull request Jan 20, 2026

feat: integrate parsed stats with data skipping #1639

Open

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch 3 times, most recently from e92ca69 to ebc62cd Compare January 20, 2026 21:14

DrakeLin changed the title ~~read_stats~~ feat: read parsed-stats from checkpoint Jan 20, 2026

DrakeLin requested review from dengsh12 and nicklan January 20, 2026 21:27

DrakeLin marked this pull request as ready for review January 20, 2026 21:27

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch 2 times, most recently from cea3c64 to f840878 Compare January 20, 2026 22:32

dengsh12 reviewed Jan 21, 2026

View reviewed changes

kernel/src/scan/log_replay.rs Outdated Show resolved Hide resolved

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch from f840878 to a9b9b11 Compare January 21, 2026 20:10

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch 2 times, most recently from 62144e8 to 377105a Compare January 21, 2026 22:22

DrakeLin requested a review from dengsh12 January 21, 2026 23:12

dengsh12 reviewed Jan 22, 2026

View reviewed changes

nicklan reviewed Jan 22, 2026

View reviewed changes

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch from 377105a to 9619ed9 Compare January 23, 2026 05:09

DrakeLin requested review from dengsh12 and nicklan January 23, 2026 05:10

has_compat

fdd657e

DrakeLin force-pushed the stack/has_compatible_parsed_stats branch from 9619ed9 to fdd657e Compare January 23, 2026 06:10

dengsh12 approved these changes Jan 23, 2026

View reviewed changes

nits

0edcfae

	"stats_parsed not compatible: stats_parsed.{} is not a Struct, got {:?}",
	"stats_parsed not compatible: stats_parsed. {} is not a Struct, got {:?}",

		impl Iterator<Item = DeltaResult<ActionsBatch>> + Send,
		CheckpointReadInfo,

feat: read parsed-stats from checkpoint #1638

Are you sure you want to change the base?

feat: read parsed-stats from checkpoint #1638

Conversation

DrakeLin commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

codecov bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicklan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengsh12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DrakeLin commented Jan 20, 2026 •

edited

Loading

codecov bot commented Jan 20, 2026 •

edited

Loading

dengsh12 left a comment •

edited

Loading