feat: add classic and uuid parquet checkpoint path generation #782

sebastiantia · 2025-03-31T01:29:30Z

What changes are proposed in this pull request?

This PR introduces the helper methods:

new_uuid_parquet_checkpoint which creates a new ParsedCheckpointPath<Url> for a uuid-named parquet checkpoint file at the specified version. The UUID-naming scheme looks like: n.checkpoint.u.parquet, where u is a UUID and n is the snapshot version that this checkpoint represents.
new_classic_parquet_checkpoint which creates a new ParsedCheckpointPath<Url> for a classic-named parquet checkpoint file at the specified version. The classic-naming scheme looks like: n.checkpoint.parquet, where n is the snapshot version that this checkpoint represents.
Updates the uuid dependency to always include v4 and fast-rng features:
- This ensures that uuid::new_v4() is always available.
- The fast-rng feature improves performance when generating UUIDs.

For more information on the two checkpoint naming-schemes:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#uuid-named-checkpoint
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#classic-checkpoint

This PR is part of the on-going effort to implement single-file checkpoint write support. For reference, [link to write API proposal]

How was this change tested?

test_new_uuid_parquet_checkpoint - verifies UUID-named Parquet checkpoint creation with proper attributes.
test_new_classic_parquet_checkpoint - verifies classic-named Parquet checkpoint creation with proper attributes.

codecov · 2025-03-31T01:32:41Z

Codecov Report

Attention: Patch coverage is 76.19048% with 15 lines in your changes missing coverage. Please review.

Project coverage is 84.65%. Comparing base (4ad2bc6) to head (c95e5c9).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
kernel/src/path.rs	76.19%	8 Missing and 7 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #782      +/-   ##
==========================================
- Coverage   84.67%   84.65%   -0.02%     
==========================================
  Files          83       83              
  Lines       19780    19836      +56     
  Branches    19780    19836      +56     
==========================================
+ Hits        16748    16792      +44     
- Misses       2213     2221       +8     
- Partials      819      823       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zachschuermann

looks good, just a comment on simplification

kernel/src/path.rs

zachschuermann

comment again on simplifying - sorry for taking you back in the other direction now!

kernel/src/path.rs

roeap

overall looking good, just raised one point around errors, but would just follow other folks opinions on this :)

kernel/src/path.rs

roeap · 2025-04-03T09:17:19Z

kernel/src/path.rs

+        if !path.is_checkpoint() {
+            return Err(Error::internal_error(
+                "ParsedLogPath::new_classic_parquet_checkpoint created a non-checkpoint path",
+            ));
+        }


we do this for commits as well, however since the user cannot really do anything about this (we are generating the file name, and deciding what the log folder is), this feels we should catch this at test time. i.e. are we not just validating that the logic we use to generate a path is compatible with how we check if a path is a checkpoint?

In other words, whenever this raises, it points to a bug in our code?

You're absolutely right—if this check ever fails, it would indicate a bug in our path generation logic rather than something the user can control. Given that this test coverage already exists, we likely don’t need the runtime checks. Do you think there's additional value of keeping the checks? @zachschuermann

fair - I wonder if we could do better in the future by having the types encode whether or not it is checkpoint etc. (then it wouldn't be a runtime check here - if you have a checkpoint-specific constructor and a checkpoint type then you are guaranteed to get back a checkpoint)

given that this is no different than what we had before (and it's InternalError) how about we take this as a follow up?

@sebastiantia can you make an issue?

scovich

Since this changes our dependency setup, should we wait to merge until we actually have a use for the code? Or worth it to merge this separately?

kernel/src/path.rs

sebastiantia · 2025-04-03T16:52:48Z

Since this changes our dependency setup, should we wait to merge until we actually have a use for the code? Or worth it to merge this separately?

I think this is reasonable, I have a few other PRs for checkpoint write support (tracked here #795) that need to merge before I can leverage this code (in #797). But I did plan to merge this as a separate PR for ease of reviews.

zachschuermann

LGTM after addressing a couple quick things

kernel/src/path.rs

zachschuermann · 2025-04-04T16:29:38Z

kernel/src/path.rs

+        if !path.is_checkpoint() {
+            return Err(Error::internal_error(
+                "ParsedLogPath::new_classic_parquet_checkpoint created a non-checkpoint path",
+            ));
+        }


fair - I wonder if we could do better in the future by having the types encode whether or not it is checkpoint etc. (then it wouldn't be a runtime check here - if you have a checkpoint-specific constructor and a checkpoint type then you are guaranteed to get back a checkpoint)

given that this is no different than what we had before (and it's InternalError) how about we take this as a follow up?

@sebastiantia can you make an issue?

kernel/src/path.rs

…io#782)   ## What changes are proposed in this pull request?  This PR introduces the helper methods: - `new_uuid_parquet_checkpoint` which creates a new `ParsedCheckpointPath<Url>` for a uuid-named parquet checkpoint file at the specified version. The UUID-naming scheme looks like: `n.checkpoint.u.parquet`, where u is a UUID and n is the snapshot version that this checkpoint represents. - `new_classic_parquet_checkpoint` which creates a new `ParsedCheckpointPath<Url>` for a classic-named parquet checkpoint file at the specified version. The classic-naming scheme looks like: `n.checkpoint.parquet`, where n is the snapshot version that this checkpoint represents. - **Updates the `uuid` dependency to always include `v4` and `fast-rng` features:** - This ensures that `uuid::new_v4()` is always available. - The `fast-rng` feature improves performance when generating UUIDs. For more information on the two checkpoint naming-schemes: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#uuid-named-checkpoint https://github.com/delta-io/delta/blob/master/PROTOCOL.md#classic-checkpoint This PR is part of the on-going effort to implement single-file checkpoint write support. For reference, [[link to write API proposal]](delta-io#779)  ## How was this change tested?  - `test_new_uuid_parquet_checkpoint` - verifies UUID-named Parquet checkpoint creation with proper attributes. - `test_new_classic_parquet_checkpoint` - verifies classic-named Parquet checkpoint creation with proper attributes.

sebastiantia added 2 commits March 30, 2025 18:09

classic & uuid parquest checkpoint paths

1c0c5ae

doc fix

7c6660c

sebastiantia requested a review from zachschuermann March 31, 2025 01:29

github-actions bot assigned sebastiantia Mar 31, 2025

make uuid globalg

788719c

zachschuermann reviewed Apr 1, 2025

View reviewed changes

kernel/src/path.rs Show resolved Hide resolved

kernel/src/path.rs Outdated Show resolved Hide resolved

refactor

dbcdd90

sebastiantia mentioned this pull request Apr 2, 2025

single-file checkpoint write API progress tracker #795

Open

8 tasks

refactor

b552a90

sebastiantia requested a review from zachschuermann April 2, 2025 00:44

zachschuermann reviewed Apr 2, 2025

View reviewed changes

kernel/src/path.rs Show resolved Hide resolved

zachschuermann requested a review from roeap April 2, 2025 16:00

sebastiantia added 2 commits April 2, 2025 10:32

refactor

35bbbdc

dead code

0606527

sebastiantia requested review from hntd187 and zachschuermann April 2, 2025 17:32

sebastiantia and others added 2 commits April 2, 2025 10:35

Merge branch 'main' into create_checkpoint_paths

c4ba047

remove line

d41d1d3

sebastiantia requested a review from OussamaSaoudi April 2, 2025 18:33

roeap approved these changes Apr 3, 2025

View reviewed changes

scovich reviewed Apr 3, 2025

View reviewed changes

kernel/src/path.rs Outdated Show resolved Hide resolved

sebastiantia requested a review from scovich April 3, 2025 16:53

sebastiantia added 3 commits April 3, 2025 09:55

cut line

9527dd9

Merge branch 'main' into create_checkpoint_paths

576ae4c

include trailing /

9c0b18b

zachschuermann approved these changes Apr 4, 2025

View reviewed changes

sebastiantia and others added 2 commits April 7, 2025 09:48

nits

6978175

Merge branch 'main' into create_checkpoint_paths

bd141b9

sebastiantia added 2 commits April 7, 2025 12:15

Merge branch 'main' into create_checkpoint_paths

2368dfb

Merge branch 'main' into create_checkpoint_paths

c95e5c9

sebastiantia merged commit c84b079 into delta-io:main Apr 7, 2025
19 of 21 checks passed

sebastiantia deleted the create_checkpoint_paths branch April 7, 2025 20:54

This was referenced Apr 14, 2025

[tracking issue] checkpoint write support #499

Closed

[tracking issue] Single-file classic-named V1 & V2 checkpoint write support #736

Closed

feat: add classic and uuid parquet checkpoint path generation #782

feat: add classic and uuid parquet checkpoint path generation #782

Uh oh!

Conversation

sebastiantia commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this change tested?

Uh oh!

codecov bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roeap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

roeap Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

sebastiantia Apr 3, 2025

Choose a reason for hiding this comment

Uh oh!

zachschuermann Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sebastiantia commented Apr 3, 2025

Uh oh!

zachschuermann left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zachschuermann Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sebastiantia commented Mar 31, 2025 •

edited

Loading

codecov bot commented Mar 31, 2025 •

edited

Loading