-
Notifications
You must be signed in to change notification settings - Fork 76
Commit 929ac08
authored
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md
2. Run `cargo t --all-features --all-targets` to get started testing,
and run `cargo fmt`.
3. Ensure you have added or run the appropriate tests for your PR.
4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
5. Be sure to keep the PR description updated to reflect all changes.
-->
<!--
PR title formatting:
This project uses conventional commits:
https://www.conventionalcommits.org/
Each PR corresponds to a commit on the `main` branch, with the title of
the PR (typically) being
used for the commit message on main. In order to ensure proper
formatting in the CHANGELOG please
ensure your PR title adheres to the conventional commit specification.
Examples:
- new feature PR: "feat: new API for snapshot.update()"
- bugfix PR: "fix: correctly apply DV in read-table example"
-->
## What changes are proposed in this pull request?
<!--
Please clarify what changes you are proposing and why the changes are
needed.
The purpose of this section is to outline the changes, why they are
needed, and how this PR fixes the issue.
If the reason for the change is already explained clearly in an issue,
then it does not need to be restated here.
1. If you propose a new API or feature, clarify the use case for a new
API or feature.
2. If you fix a bug, you can clarify why it is a bug.
-->
### Key changes
resolves #737.
This PR implements the `CheckpointVisitor` necessary for filtering a
stream of actions into a stream of actions to be included in a
checkpoint file. This leverages the `FileActionDeduplicator` [[link to
PR]](#769).
This PR introduces the `checkpoint` mod, and implements the visitor in
the new `checkpoint/log_replay` mod.
Comprehensive module documents are included in the new modules which
provide an overview of the incoming code additions, along with it's
goal.
### Checkpoint Content
A **complete V1 checkpoint** encapsulates:
1. All FILE actions that make up the state of a version of a table:
- Add actions (after action reconciliation)
- Unexpired remove actions ([remove
tombstones](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#add-file-and-remove-file))
2. All NON-FILE actions that make up the state of a version of a table:
- Protocol action
- Metadata action
- Txn actions
A **single-file V2 checkpoint** is simply a super-set of the actions
included in the V1 checkpoint schema, with the addition of the
`CheckpointMetadata` action (which must be generated on every write).
Since single-file v2 checkpoints will also leverage this visitor, we
have chosen to name it the general `CheckpointVisitor`
Note:
- CDC, CommitInfo, Sidecar, and CheckpointMetadata actions are NOT part
of the **V1** checkpoint schema.
- Sidecar and CheckpointMetadata actions are part of the **V2**
checkpoint schema.
### The new `CheckpointVisitor`
This visitor selects the **FILE** actions for a V1 spec checkpoint via a
selection vector:
1. Processes add/remove actions with proper deduplication based on path
and deletion vector ID pairs
2. Optimization: Only tracks already seen file paths in **commit
files**, as actions in checkpoint files are the last batches to be
processed, and do not conflict with other actions in checkpoint files.
3. Applies tombstone expiration logic by filtering out remove actions
with deletion timestamps older than the minimum file retention timestamp
This visitor also selects the **NON-FILE** actions for a V1 spec
checkpoint via a selection vector:
1. Ensures exactly one protocol action is included (the newest one
encountered)
2. Ensures exactly one metadata action is included (the newest one
encountered)
3. Deduplicates transaction (txn) actions by app ID to include only the
newest action for each app ID
<!--
Uncomment this section if there are any changes affecting public APIs:
### This PR affects the following public APIs
If there are breaking changes, please ensure the `breaking-changes`
label gets added by CI, and describe why the changes are needed.
Note that _new_ public APIs are not considered breaking.
-->
## How was this change tested?
<!--
Please make sure to add test cases that check the changes thoroughly
including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please
clarify how you tested, ideally via a reproducible test documented in
the PR description.
-->
`test_checkpoint_visitor` - Tests basic functionality with both file and
non-file actions, verifying correct counts and selection vector.
`test_checkpoint_visitor_boundary_cases_for_tombstone_expiration` -
Tests how tombstone expiration handles threshold boundary conditions.
`test_checkpoint_visitor_conflicting_file_actions_in_log_batch` -
Verifies duplicate path handling in log batches (keeping first, skipping
second).
`test_checkpoint_visitor_file_actions_in_checkpoint_batch` - Tests that
duplicate file actions are included in checkpoint batches.
`test_checkpoint_visitor_conflicts_with_deletion_vectors` - Tests file
deduplication with deletion vectors to ensure uniqueness.
`test_checkpoint_visitor_already_seen_non_file_actions` - Verifies that
pre-populated actions are skipped correctly.
`test_checkpoint_visitor_duplicate_non_file_actions` - Tests
deduplication of non-file actions (protocol, metadata, transactions).
1 parent 7e62d12 commit 929ac08Copy full SHA for 929ac08
File tree
5 files changed
+606
-6
lines changedFilter options
- kernel/src
- checkpoint
- scan
5 files changed
+606
-6
lines changed
0 commit comments