Skip to content

Commit 929ac08

Browse files
authored
feat: add CheckpointVisitor in new checkpoint mod (#738)
<!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> ### Key changes resolves #737. This PR implements the `CheckpointVisitor` necessary for filtering a stream of actions into a stream of actions to be included in a checkpoint file. This leverages the `FileActionDeduplicator` [[link to PR]](#769). This PR introduces the `checkpoint` mod, and implements the visitor in the new `checkpoint/log_replay` mod. Comprehensive module documents are included in the new modules which provide an overview of the incoming code additions, along with it's goal. ### Checkpoint Content A **complete V1 checkpoint** encapsulates: 1. All FILE actions that make up the state of a version of a table: - Add actions (after action reconciliation) - Unexpired remove actions ([remove tombstones](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file)) 2. All NON-FILE actions that make up the state of a version of a table: - Protocol action - Metadata action - Txn actions A **single-file V2 checkpoint** is simply a super-set of the actions included in the V1 checkpoint schema, with the addition of the `CheckpointMetadata` action (which must be generated on every write). Since single-file v2 checkpoints will also leverage this visitor, we have chosen to name it the general `CheckpointVisitor` Note: - CDC, CommitInfo, Sidecar, and CheckpointMetadata actions are NOT part of the **V1** checkpoint schema. - Sidecar and CheckpointMetadata actions are part of the **V2** checkpoint schema. ### The new `CheckpointVisitor` This visitor selects the **FILE** actions for a V1 spec checkpoint via a selection vector: 1. Processes add/remove actions with proper deduplication based on path and deletion vector ID pairs 2. Optimization: Only tracks already seen file paths in **commit files**, as actions in checkpoint files are the last batches to be processed, and do not conflict with other actions in checkpoint files. 3. Applies tombstone expiration logic by filtering out remove actions with deletion timestamps older than the minimum file retention timestamp This visitor also selects the **NON-FILE** actions for a V1 spec checkpoint via a selection vector: 1. Ensures exactly one protocol action is included (the newest one encountered) 2. Ensures exactly one metadata action is included (the newest one encountered) 3. Deduplicates transaction (txn) actions by app ID to include only the newest action for each app ID <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> `test_checkpoint_visitor` - Tests basic functionality with both file and non-file actions, verifying correct counts and selection vector. `test_checkpoint_visitor_boundary_cases_for_tombstone_expiration` - Tests how tombstone expiration handles threshold boundary conditions. `test_checkpoint_visitor_conflicting_file_actions_in_log_batch` - Verifies duplicate path handling in log batches (keeping first, skipping second). `test_checkpoint_visitor_file_actions_in_checkpoint_batch` - Tests that duplicate file actions are included in checkpoint batches. `test_checkpoint_visitor_conflicts_with_deletion_vectors` - Tests file deduplication with deletion vectors to ensure uniqueness. `test_checkpoint_visitor_already_seen_non_file_actions` - Verifies that pre-populated actions are skipped correctly. `test_checkpoint_visitor_duplicate_non_file_actions` - Tests deduplication of non-file actions (protocol, metadata, transactions).
1 parent 7e62d12 commit 929ac08

File tree

5 files changed

+606
-6
lines changed

5 files changed

+606
-6
lines changed

0 commit comments

Comments
 (0)