-
Notifications
You must be signed in to change notification settings - Fork 661
Optimized RecordCursor, Remove read_varint_fast #4363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
fereidani
wants to merge
12
commits into
tursodatabase:main
Choose a base branch
from
fereidani:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+97
−132
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
benchmarking showed that read_varint is equally fast for small bytes as read_varint_fast. The fast path in read_varint_fast becomes slower when the varint exceeds 2 bytes because it falls back and restarts the calculation. previous attempts to optimize it further using bit manipulation tricks did not yield improvements. for now, the standard read_varint implementation is the best option. If a superior algorithm is identified in the future, it should replace read_varint directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review @jussisaurio
PThorpe92
reviewed
Dec 27, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This is done to improve op_column and ensure_parsed_upto which are performance intensive functions, and remove multiple memory allocations.
This implementation reduces memory allocations of these functions by 2x and also simplifies implementation and removes many redundant checks.
Faster RecordCursor implementation
I found out that older RecordCursor is growing multiple Vectors too many times.
The older algorithm used two different vectors which one of them was oversized to hold first offset, first offset is actually saved in
head_sizeso we don't need to keep it further.The new algorithm merges both serial_types and offsets to a single Vector, and relies on
header_sizeinstead of reserving one more entry in the vector.This new implementation reduces number of grows required to adjust the vector size by 2x, This improves performance specially for longer more complex queries with large amount of records.
Faster read_varint_fast
I improvedread_varint_fastimplementation by reading one single u32 from memory and then doing register operations on it, for loop inside read_varint_fast is going to be unrolled by the compiler.New read_varint_fast is faster than old one because it continues the failed attempt from index of 4 instead of restarting from index 0 like beffore.Don't worry about the over reads that happens by the new algorithm as it discards them when it finds the correct value.I tried many benchmarks and found out LLVM is generating better code for read_varint than anything than I wrote or was there before, actually original read_varint_fast and read_varint had no difference for shorter inputs, and read_varint_fast was actually only slower for longer than 2 varints.
I argue that read_varint_fast should not be in the library, and if we found a better algorithm we should directly improve read_varint.
Improvements
My benchmarks shows about 1-6% gain:
Table for WITHOUT ANALYZE TPC-H:
HTML Chart:
comparison_chart_WITHOUT_ANALYZE.html
Motivation and context
Improving this great project!
Description of AI Usage
No direct, Slightly for comments and final code review.