Skip to content

feat: add transaction.with_transaction_id() API #824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

zachschuermann
Copy link
Collaborator

@zachschuermann zachschuermann commented Apr 11, 2025

What changes are proposed in this pull request?

(part of idempotent writes support) this PR adds transaction.with_transaction_id(app_id: String, version: i64) API for commiting txn actions to the log. The only invariant imposed by kernel (as indicated by the spec) is that each app_id in a commit is unique.

The semantics of the application-specific version are left up to the external system. Delta only ensures that the latest version for a given appId is available in the table snapshot. The Delta transaction protocol does not, for example, assume monotonicity of the version and it would be valid for the version to decrease, possibly representing a "rollback" of an earlier transaction.

Note that in order to make this useful to engines we need to expose SetTransactionScanner appropriately (in follow-up)

This PR affects the following public APIs

Add Transaction::with_set_transaction(app_id, version)

How was this change tested?

TODO

resolves #845

Copy link

codecov bot commented Apr 11, 2025

Codecov Report

Attention: Patch coverage is 98.36066% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.11%. Comparing base (aebea47) to head (13dbdbc).

Files with missing lines Patch % Lines
kernel/src/actions/set_transaction.rs 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #824      +/-   ##
==========================================
+ Coverage   85.07%   85.11%   +0.04%     
==========================================
  Files          84       84              
  Lines       20802    20850      +48     
  Branches    20802    20850      +48     
==========================================
+ Hits        17697    17747      +50     
+ Misses       2227     2226       -1     
+ Partials      878      877       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added the breaking-change Change that require a major version bump label Apr 11, 2025
@@ -386,7 +386,7 @@ trait EvaluationHandlerExtension: EvaluationHandler {
}

// Auto-implement the extension trait for all EvaluationHandlers
impl<T: EvaluationHandler> EvaluationHandlerExtension for T {}
impl<T: EvaluationHandler + ?Sized> EvaluationHandlerExtension for T {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why needed, out of curiosity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah since we use this on dyn EvaluationHandler it's !Sized - so have to relax that here

Comment on lines 108 to 116
// step 0: if there are SetTransaction(app_id, version) actions being committed, ensure
// that for every app_id, this commit's version is greater than any previous version. if
// so, we chain them to our actions iterator.
//
// TODO: we are currently doing set_transaction validation (another log replay) as a
// separate step but we could be smarter and merge this with the snapshot log replay above
// to produce a special snapshot with SetTransactions (but how do we know which ones to
// care about ahead of time?)
match self.validate_set_transactions(engine)? {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Delta spec specifically does not require monotonically increasing versions:

The semantics of the application-specific version are left up to the external system. Delta only ensures that the latest version for a given appId is available in the table snapshot. The Delta transaction protocol does not, for example, assume monotonicity of the version and it would be valid for the version to decrease, possibly representing a "rollback" of an earlier transaction.

We just need to block transactions with duplicate app ids.

Also, two overlapping transactions logically conflict if they mention the same app id. That won't matter until we add retries for resolving physical commit conflicts tho.

let actions = chain(iter::once(commit_info), adds);
let add_actions = generate_adds(engine, self.write_metadata.iter().map(|a| a.as_ref()));

// TODO: should we chain before/after file actions?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't matter which one we do, because the Delta spec doesn't impose any ordering constraints other than commit info must be first. So even if we put the app ids first, we'd still have to read the whole commit to prove no other app ids came later.

Comment on lines 59 to 63
// NB: we could use HashMap<String, SetTransaction> for set_transactions but since the expected
// number is small, we will save an allocation and (likely) have better memory locality just
// with a Vec. (following rule-of-thumb that 10-100 linear search will do same/better than hash
// lookup)
set_transactions: Vec<SetTransaction>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that the expected number of entries is small, the quadratic cost of duplicate detection could be catastrophic in the uncommon case where somebody did try to insert a large number of app ids.

That said, neither hashmap nor hashset seems to be a good fit -- hashmap would require either duplicating the appid or splitting SetTransaction into "key" (appid) and "payload" (everything else) pairs; hashset would require us to impl Borrow<&str> for SetTransaction, which in turn would require SetTransaction to have the same Eq, Ord, and Hash as &str. Plus, HashSet::insert drops the to-be-inserted value without returning the existing one, which would make error messaging unnecessarily difficult.

Perhaps, following the builder pattern, it's ok to make with_set_transaction infallible, and defer to the commit method for a one-time duplicate check? Then we pay O(n) validation cost by building up a HashSet<&str>:

        let mut app_ids = HashSet::new();
        if let Some(dup) = self.set_transactions.iter().find(|t| !app_ids.insert(&t.app_id)) {
            return Err(Error::generic(format!(
                "app_id {} already exists in transaction",
                dup.app_id
            )));
        }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep sounds good - i had originally built this way but attempted to make it better by checking "along the way" but now reverted back to this :)

@zachschuermann zachschuermann changed the title [WIP] feat: idempotent write support feat: add transaction.with_transaction_id() API Apr 14, 2025
@zachschuermann zachschuermann removed the breaking-change Change that require a major version bump label Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for transaction.with_transaction_id()
2 participants