-
Notifications
You must be signed in to change notification settings - Fork 139
feat: add stats transform module for checkpoint stats population #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1646 +/- ##
==========================================
- Coverage 84.65% 84.60% -0.06%
==========================================
Files 123 124 +1
Lines 34109 34358 +249
Branches 34109 34358 +249
==========================================
+ Hits 28875 29067 +192
- Misses 3905 3960 +55
- Partials 1329 1331 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
af4261b to
0aefe2a
Compare
0aefe2a to
628c05b
Compare
628c05b to
66c1b43
Compare
55236f5 to
eab701e
Compare
nicklan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly looks good, just a few small things.
one bigger question, we're doing a lot of clone()ing of fields here. Are the passed schema generally owned such that we could massage them in place, or are they anyway shared schema that we'd have to clone in order to modify? If it's the former, maybe let's make a follow-up to add the needed ability to StructType to insert/remove/modify fields and use that to avoid quite so much copying.
| //! Transforms for populating stats_parsed and stats fields in checkpoint data. | ||
| //! | ||
| //! When writing checkpoints, statistics can be stored in two formats: | ||
| //! - `stats`: JSON string format (controlled by `writeStatsAsJson` table property) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably say it's the delta.writeStatsAsJson property right? ditto next line. let's also note the default values
fa1de9f to
4e66ca0
Compare
## 🥞 Stacked PR Use this [link](https://github.com/delta-io/delta-kernel-rs/pull/1645/files) to review incremental changes. - [**stack/null-propagation**](#1645) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1645/files)] - [stack/coalesce](#1648) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1648/files/37e755009566511bf7c2f00e014c1647e77e4533..d64042f7908844ef2d8a1c68312dc3ff936d60dc)] - [stack/checkpoint-transforms](#1646) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1646/files/d64042f7908844ef2d8a1c68312dc3ff936d60dc..4e66ca004f89b23431a96ac106a9c0d400718b10)] - [stack/write-stats](#1643) [[Files changed](https://github.com/delta-io/delta-kernel-rs/pull/1643/files/4e66ca004f89b23431a96ac106a9c0d400718b10..cd64f79fd3b40ebfa811cb333369cb17aa1a2a74)] --------- ## What changes are proposed in this pull request? Fixes a bug in nested transform expression evaluation where null rows in the source struct were losing their null bitmap, causing null structs to incorrectly appear as non-null structs with null fields. When evaluating nested transform expressions (transforms with an input_path that operate on a nested struct), the output StructArray was created with None for the null buffer: `let data = StructArray::try_new(output_fields.into(), output_cols, None)?;` This meant that if the source struct had null rows (e.g., an add action that is null in a checkpoint batch), the output would lose that null information. The struct would appear as non-null but with all-null fields, which is semantically different. ## How was this change tested? Existing transform tests pass. The stats transform integration tests (in a follow-up PR) exercise this code path.
9848c2d to
0f6c163
Compare
Copying is necessary, put up this issue: #1657 |
0f6c163 to
4b77bb4
Compare
| //! based on the table configuration. Statistics can be stored in two formats as fields on | ||
| //! the `Add` action: | ||
| //! - `stats`: JSON string format, controlled by `delta.checkpoint.writeStatsAsJson` (default: true) | ||
| //! - `stats_parsed`: Native struct format, controlled by `delta.checkpoint.writeStatsAsStruct` (default: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's default to false in protocol?
stats_parsed: The stats can be stored in their original format. This field needs to be written when statistics are available and the table property: delta.checkpoint.writeStatsAsStruct is set to true. When this property is set to false (which is the default), this field should be omitted from the checkpoint.
| pub(super) fn from_table_properties(properties: &TableProperties) -> Self { | ||
| Self { | ||
| write_stats_as_json: properties.checkpoint_write_stats_as_json.unwrap_or(true), | ||
| write_stats_as_struct: properties.checkpoint_write_stats_as_struct.unwrap_or(true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto on the default as false
| // Verify we get a Transform expression | ||
| let Expression::Transform(_) = transform_expr.as_ref() else { | ||
| panic!("Expected Transform expression"); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: This test only verifies the return type is Transform. Wonder if we want to test the inner content of the transform_expr as well? E.g.
- Outer transform replaces
"add"field - Inner transform has
field_transformsfor"stats"(with COALESCE expr) and"stats_parsed"(dropped)
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Adds a new stats_transform module that provides the core logic for populating stats and stats_parsed fields when writing checkpoints. This module builds transform expressions based on table configuration to ensure statistics are properly converted between JSON and struct formats.
How was this change tested?
Next Pr