-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add V1CheckpointVisitor
that wraps the new FileActionDeduplicator
#738
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #738 +/- ##
==========================================
+ Coverage 84.34% 84.82% +0.47%
==========================================
Files 81 83 +2
Lines 19233 20133 +900
Branches 19233 20133 +900
==========================================
+ Hits 16223 17078 +855
- Misses 2207 2220 +13
- Partials 803 835 +32 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0221955
to
19733cd
Compare
/// access stale snapshots. A remove action remains as a tombstone in a checkpoint file until | ||
/// it expires, which happens when the current time exceeds the removal timestamp plus the | ||
/// expiration threshold. | ||
fn is_expired_tombstone<'a>(&self, i: usize, getter: &'a dyn GetData<'a>) -> DeltaResult<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, remove.deletion_timestamp is set to 0 in Java-kernel. So if the field is not present, the remove action is excluded from the checkpoint file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add inline
CheckpointFileActionsVisitor
and CheckpointNonFileActionsVisitor
CheckpointVisitor
kernel/src/actions/visitors.rs
Outdated
/// This filtered set of actions represents the minimal set needed to reconstruct | ||
/// the latest valid state of the table. | ||
#[cfg_attr(feature = "developer-visibility", visibility::make(pub))] | ||
pub(crate) struct CheckpointVisitor<'seen> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the AddRemoveDedupVisitor
, I think this should be moved to a new /checkpoint/log_replay
mod
CheckpointVisitor
CheckpointVisitor
that wraps the new FileActionDeduplicator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced offline - decided to split out FileActionDeduplicator first then land the checkpoint visitor
8cf054b
to
51fbc68
Compare
9dd207b
to
7c3d976
Compare
…to embeddable `FileActionsDeduplicator` (#769) <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://github.com/delta-incubator/delta-kernel-rs/blob/main/CONTRIBUTING.md 2. Run `cargo t --all-features --all-targets` to get started testing, and run `cargo fmt`. 3. Ensure you have added or run the appropriate tests for your PR. 4. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP] Your PR title ...'. 5. Be sure to keep the PR description updated to reflect all changes. --> <!-- PR title formatting: This project uses conventional commits: https://www.conventionalcommits.org/ Each PR corresponds to a commit on the `main` branch, with the title of the PR (typically) being used for the commit message on main. In order to ensure proper formatting in the CHANGELOG please ensure your PR title adheres to the conventional commit specification. Examples: - new feature PR: "feat: new API for snapshot.update()" - bugfix PR: "fix: correctly apply DV in read-table example" --> ## What changes are proposed in this pull request? <!-- Please clarify what changes you are proposing and why the changes are needed. The purpose of this section is to outline the changes, why they are needed, and how this PR fixes the issue. If the reason for the change is already explained clearly in an issue, then it does not need to be restated here. 1. If you propose a new API or feature, clarify the use case for a new API or feature. 2. If you fix a bug, you can clarify why it is a bug. --> **No behavioral changes were introduced, this is purely a refactoring effort** This PR extracts the core deduplication logic from the `AddRemoveDedupVisitor` in order to be shared with the incoming `CheckpointVisitor`. For a bigger picture view on how this refactor is helpful, please take a look at the following PR which implements the `CheckpointVisitor` with an embedded `FileActionsDeduplicator` that will rebase this PR once merged. [[link to PR]](#738). This `FileActionsDeduplicator` lives in the new top-level `log_replay` mod as it will be leveraged in the nested `scan/log_replay` mod and the incoming `checkpoints/log_replay` mod. There are also additional traits & structs that the two `log_replay` implementations will share via this new top-level mod. For an even wider view of the implementation of the `checkpoints` mod and the component re-use, please have a look at the following PR. [[link to PR]](#744) ## Summary of refactor 1. New `log_replay` mod 2. Moved `FileActionKey` definition from `scan/log_replay` to the new `log_replay` mod 3. New `FileActionDeduplicator` in the new `log_replay` mod - Includes the `check_and_record_seen` method which was simply **moved** from the `AddRemoveDedupVisitor` - Includes the `extract_file_action` method and `extract_dv_unique_id` private method which may be new concepts, but includes functionality which are both pieces of functionality pulled from the `AddRemoveDedupVisitor` to be shared with the incoming `CheckpointVisitor` <!-- Uncomment this section if there are any changes affecting public APIs: ### This PR affects the following public APIs If there are breaking changes, please ensure the `breaking-changes` label gets added by CI, and describe why the changes are needed. Note that _new_ public APIs are not considered breaking. --> ## How was this change tested? <!-- Please make sure to add test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested, ideally via a reproducible test documented in the PR description. --> All existing tests pass ✅
CheckpointVisitor
that wraps the new FileActionDeduplicator
V1CheckpointVisitor
that wraps the new FileActionDeduplicator
What changes are proposed in this pull request?
This PR implements the visitor necessary for filtering a stream of actions into a stream of actions to be included in a checkpoint file. resolves #737. This leverages the
FileActionDeduplicator
[link to PR], so please only review these commits.A checkpoint's contents:
A complete V1 checkpoint encapsulates:
Note:
The V1CheckpointVisitor:
This visitor selects the FILE actions for a V1 spec checkpoint via a selection vector:
This visitor also selects the NON-FILE actions for a V1 spec checkpoint via a selection vector:
How was this change tested?
test_checkpoint_visitor
- Tests basic functionality with both file and non-file actions, verifying correct counts and selection vector.test_v1_checkpoint_visitor_boundary_cases_for_tombstone_expiration
- Tests how tombstone expiration handles threshold boundary conditions.test_v1_checkpoint_visitor_conflicting_file_actions_in_log_batch
- Verifies duplicate path handling in log batches (keeping first, skipping second).test_v1_checkpoint_visitor_file_actions_in_checkpoint_batch
- Tests that duplicate file actions are included in checkpoint batches.test_v1_checkpoint_visitor_conflicts_with_deletion_vectors
- Tests file deduplication with deletion vectors to ensure uniqueness.test_v1_checkpoint_visitor_already_seen_non_file_actions
- Verifies that pre-populated actions are skipped correctly.test_v1_checkpoint_visitor_duplicate_non_file_actions
- Tests deduplication of non-file actions (protocol, metadata, transactions).