Skip to content

Conversation

@DrakeLin
Copy link
Collaborator

@DrakeLin DrakeLin commented Jan 21, 2026

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

Adds a new stats_transform module that provides the core logic for populating stats and stats_parsed fields when writing checkpoints. This module builds transform expressions based on table configuration to ensure statistics are properly converted between JSON and struct formats.

writeStatsAsJson writeStatsAsStruct stats stats_parsed
true false COALESCE(stats, ToJson(stats_parsed)) drop
true true COALESCE(stats, ToJson(stats_parsed)) COALESCE(stats_parsed, ParseJson(stats))
false true drop COALESCE(stats_parsed, ParseJson(stats))
false false drop drop

How was this change tested?

Next Pr

@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 78.31325% with 54 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.60%. Comparing base (d4ecc0a) to head (4b77bb4).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/checkpoint/stats_transform.rs 78.31% 54 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1646      +/-   ##
==========================================
- Coverage   84.65%   84.60%   -0.06%     
==========================================
  Files         123      124       +1     
  Lines       34109    34358     +249     
  Branches    34109    34358     +249     
==========================================
+ Hits        28875    29067     +192     
- Misses       3905     3960      +55     
- Partials     1329     1331       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Jan 21, 2026
@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch from af4261b to 0aefe2a Compare January 21, 2026 00:21
@DrakeLin DrakeLin marked this pull request as ready for review January 21, 2026 00:25
@DrakeLin DrakeLin requested review from dengsh12 and nicklan January 21, 2026 00:25
@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch from 0aefe2a to 628c05b Compare January 21, 2026 00:37
@github-actions github-actions bot removed the breaking-change Change that require a major version bump label Jan 21, 2026
@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch from 628c05b to 66c1b43 Compare January 21, 2026 07:24
@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch 2 times, most recently from 55236f5 to eab701e Compare January 21, 2026 23:38
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Jan 21, 2026
Copy link
Member

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good, just a few small things.

one bigger question, we're doing a lot of clone()ing of fields here. Are the passed schema generally owned such that we could massage them in place, or are they anyway shared schema that we'd have to clone in order to modify? If it's the former, maybe let's make a follow-up to add the needed ability to StructType to insert/remove/modify fields and use that to avoid quite so much copying.

//! Transforms for populating stats_parsed and stats fields in checkpoint data.
//!
//! When writing checkpoints, statistics can be stored in two formats:
//! - `stats`: JSON string format (controlled by `writeStatsAsJson` table property)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably say it's the delta.writeStatsAsJson property right? ditto next line. let's also note the default values

@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch 2 times, most recently from fa1de9f to 4e66ca0 Compare January 23, 2026 03:14
DrakeLin added a commit that referenced this pull request Jan 23, 2026
## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1645/files) to
review incremental changes.
-
[**stack/null-propagation**](#1645)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1645/files)]
-
[stack/coalesce](#1648)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1648/files/37e755009566511bf7c2f00e014c1647e77e4533..d64042f7908844ef2d8a1c68312dc3ff936d60dc)]
-
[stack/checkpoint-transforms](#1646)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1646/files/d64042f7908844ef2d8a1c68312dc3ff936d60dc..4e66ca004f89b23431a96ac106a9c0d400718b10)]
-
[stack/write-stats](#1643)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1643/files/4e66ca004f89b23431a96ac106a9c0d400718b10..cd64f79fd3b40ebfa811cb333369cb17aa1a2a74)]

---------
## What changes are proposed in this pull request?

Fixes a bug in nested transform expression evaluation where null rows in
the source struct were losing their null bitmap, causing null structs to
incorrectly appear as non-null structs with null fields.

When evaluating nested transform expressions (transforms with an
input_path that operate on a nested struct), the output StructArray was
created with None for the null buffer:
`let data = StructArray::try_new(output_fields.into(), output_cols,
None)?;`
This meant that if the source struct had null rows (e.g., an add action
that is null in a checkpoint batch), the output would lose that null
information. The struct would appear as non-null but with all-null
fields, which is semantically different.

## How was this change tested?
Existing transform tests pass. The stats transform integration tests (in
a follow-up PR) exercise this code path.
@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch 3 times, most recently from 9848c2d to 0f6c163 Compare January 23, 2026 05:01
@DrakeLin
Copy link
Collaborator Author

mostly looks good, just a few small things.

one bigger question, we're doing a lot of clone()ing of fields here. Are the passed schema generally owned such that we could massage them in place, or are they anyway shared schema that we'd have to clone in order to modify? If it's the former, maybe let's make a follow-up to add the needed ability to StructType to insert/remove/modify fields and use that to avoid quite so much copying.

Copying is necessary, put up this issue: #1657

@DrakeLin DrakeLin force-pushed the stack/checkpoint-transforms branch from 0f6c163 to 4b77bb4 Compare January 23, 2026 05:04
@DrakeLin DrakeLin requested a review from nicklan January 23, 2026 05:04
//! based on the table configuration. Statistics can be stored in two formats as fields on
//! the `Add` action:
//! - `stats`: JSON string format, controlled by `delta.checkpoint.writeStatsAsJson` (default: true)
//! - `stats_parsed`: Native struct format, controlled by `delta.checkpoint.writeStatsAsStruct` (default: true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's default to false in protocol?

stats_parsed: The stats can be stored in their original format. This field needs to be written when statistics are available and the table property: delta.checkpoint.writeStatsAsStruct is set to true. When this property is set to false (which is the default), this field should be omitted from the checkpoint.

pub(super) fn from_table_properties(properties: &TableProperties) -> Self {
Self {
write_stats_as_json: properties.checkpoint_write_stats_as_json.unwrap_or(true),
write_stats_as_struct: properties.checkpoint_write_stats_as_struct.unwrap_or(true),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on the default as false

Comment on lines +312 to +315
// Verify we get a Transform expression
let Expression::Transform(_) = transform_expr.as_ref() else {
panic!("Expected Transform expression");
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: This test only verifies the return type is Transform. Wonder if we want to test the inner content of the transform_expr as well? E.g.

  • Outer transform replaces "add" field
  • Inner transform has field_transforms for "stats" (with COALESCE expr) and "stats_parsed" (dropped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants