-
Couldn't load subscription status.
- Fork 579
core/mvcc: Lock-free row version chain #3659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4ac7ec7 to
c1953e5
Compare
|
Hey @ddwalias! Thanks, this gives a nice throughput improvement for higher thread count. I agree that lock-free is good step forward. The commit history is very messy and for a critical change like this, we need it to be extra clean. Can you for starters do a "git rebase -i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a lock-free row version chain using the scc::LinkedList library, replacing the previous Vec-based approach with mutex protection. The implementation aims to improve concurrency by eliminating locks while maintaining the same MVCC functionality.
Key changes:
- Replaced
Vec<RowVersion>with lock-freeRowVersionChainfor storing row versions - Introduced
RowVersionNodeas a wrapper for versions in the linked list structure - Updated all database operations to work with the new lock-free iteration patterns
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| core/mvcc/database/mod.rs | Main implementation adding RowVersionNode, RowVersionChain, and updating MvStore to use lock-free operations |
| core/mvcc/database/checkpoint_state_machine.rs | Updated checkpoint logic to iterate over lock-free version chain |
| core/Cargo.toml | Added scc dependency for lock-free data structures |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
core/mvcc/database/mod.rs
Outdated
| /// A row version. | ||
| /// TODO: we can optimize this by using bitpacking for the begin and end fields. | ||
| #[derive(Clone, Debug, PartialEq)] | ||
| #[derive(Clone, Debug)] |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing PartialEq from RowVersion may break existing code that relies on equality comparisons. Consider adding a custom PartialEq implementation if needed for compatibility.
| #[derive(Clone, Debug)] | |
| #[derive(Clone, Debug, PartialEq)] |
| // This function is kinda unreliable when insert the version that has the same beginning, | ||
| // However this function is just to be compatible with how we currently handle inserting | ||
| // to row version chain and will be removed in the future, so whatever :) |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment acknowledges unreliable behavior but dismisses it casually. Consider either fixing the reliability issue or documenting the specific scenarios where it fails and their impact.
| // This function is kinda unreliable when insert the version that has the same beginning, | |
| // However this function is just to be compatible with how we currently handle inserting | |
| // to row version chain and will be removed in the future, so whatever :) | |
| // WARNING: This function does not reliably handle inserting a RowVersion whose 'begin' | |
| // timestamp is equal to that of an existing version in the chain. In such cases, the | |
| // ordering of versions with identical 'begin' timestamps may not be preserved, which can | |
| // lead to inconsistent iteration or retrieval results. This function exists only for | |
| // compatibility with the current row version chain insertion logic and is intended to be | |
| // removed in the future. Use with caution if duplicate 'begin' timestamps are possible. |
| if let Some(row_versions) = mvcc_store.rows.get(id) { | ||
| let mut row_versions = row_versions.value().write(); | ||
| for row_version in row_versions.iter_mut() { | ||
| row_versions.value().for_each_node(|node| { |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe code without proper documentation of safety invariants. Document why this unsafe access is sound and what guarantees ensure memory safety.
| row_versions.value().for_each_node(|node| { | |
| row_versions.value().for_each_node(|node| { | |
| // SAFETY: It is safe to call `node.row_version_mut()` here because: | |
| // - We have exclusive access to the `row_versions` structure for the duration of this commit. | |
| // - The closure is called sequentially for each node, and no other references to the same node exist. | |
| // - The borrow checker cannot verify this due to the data structure's internal mutability, but | |
| // the commit protocol ensures no aliasing or concurrent mutation occurs. | |
| // - Therefore, obtaining a mutable reference to the row version is sound and does not violate Rust's aliasing rules. |
| if is_write_write_conflict(&self.txs, tx, version) { | ||
| return Some(Err(LimboError::WriteWriteConflict)); | ||
| } | ||
| } |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another unsafe block without safety documentation. Add comments explaining the safety guarantees that make this access valid.
| } | |
| } | |
| // SAFETY: We have exclusive access to this node within the closure, | |
| // and no other references to the row version exist at this point. | |
| // The borrow checker cannot verify this due to the data structure, | |
| // but the logic of the surrounding code ensures that this is the only | |
| // mutable access to the row version for this row in this transaction. |
| // Hekaton uses oldest-to-newest order for row versions, so we reverse iterate to find the newest one | ||
| // this transaction changed. | ||
| for row_version in row_versions.iter_mut().rev() { | ||
| row_versions.value().for_each_node(|node| { |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe mutable access without documented safety invariants. Consider adding safety comments or exploring safer alternatives.
| row_versions.value().for_each_node(|node| { | |
| row_versions.value().for_each_node(|node| { | |
| // SAFETY: It is safe to call `node.row_version_mut()` here because: | |
| // - This function is only called during recovery, when no other references to the row versions exist. | |
| // - The `for_each_node` method guarantees exclusive access to each node during iteration. | |
| // - No other mutable or immutable references to the same row version are created in this context. | |
| // If these invariants are violated, this code could cause undefined behavior. |
| if let Some(row_versions) = self.rows.get(rowid) { | ||
| let mut row_versions = row_versions.value().write(); | ||
| for rv in row_versions.iter_mut() { | ||
| row_versions.value().for_each_node(|node| { |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe code block lacks safety documentation. Add comments explaining why this unsafe access is correct and what prevents data races.
| row_versions.value().for_each_node(|node| { | |
| row_versions.value().for_each_node(|node| { | |
| // SAFETY: It is safe to call `node.row_version_mut()` here because: | |
| // - We have exclusive access to the transaction's write set during rollback. | |
| // - The transaction is in the process of being aborted, and no other thread can access or mutate these nodes concurrently. | |
| // - The data structures involved (e.g., `self.rows`, `tx.write_set`) are only mutated by the owning transaction, and access is synchronized at a higher level. |
Hi, can we do a squash merge for this? If not, I can rebase after work. |
|
I just realized that mutating row version while iterate over row version chain might not be safe. I will mark this as draft until I can either confirm that it's safe or find an alternative way to do this |
d44a790 to
1c3ec25
Compare
025914e to
c4c68da
Compare
| if ts > self.checkpointed_txid_max_old { | ||
| version_to_checkpoint = Some(version); | ||
| if version_to_checkpoint.is_none() { | ||
| version_to_checkpoint = Some(version.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this behavior changed - previously we were iterating from oldest to newest and replacing version_to_checkpoint whenever a newer version was encountered. Now we're only setting it once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the previous row version chain Vec<RowVersion> have the records in the ascending order, whereas the new RowVersionChain linked list you have the records in the descending order. So doing this replicate the previous logic.
However, the ordering shouldn't matter, I plan to make RowVersionChain prepend only (will remove insert_sorted) so I will refactor this entirely so that we only checkpoint the version that is visible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal of this PR is only changing the data structure, while keeping the original high level design as much as possible, so that's why I don't touch the ordering.
| } | ||
| } | ||
|
|
||
| fn sentinel() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document these data structures so that it's clear what sentinel is for example? A small documentation on each data structure would be great + a separate method documentation for all non-trivial methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do it
This PR implement a lock-free row version chain using scc::LinkedList which is based on this paper
Although this is not wait-free, it will help us steer in to the right direction to implement other feature that can be based on this (based on my understanding, this will allow us to GC easier than say Treiber's stack).
If you want to read about an approach to implement a wait-free linked list, there's a paper for it, but I'm not sure about the feasibility of it with our use case.
Reference #3499