Skip to content

Conversation

@fereidani
Copy link
Contributor

@fereidani fereidani commented Dec 26, 2025

Description

This is done to improve op_column and ensure_parsed_upto which are performance intensive functions, and remove multiple memory allocations.

This implementation reduces memory allocations of these functions by 2x and also simplifies implementation and removes many redundant checks.

Screenshot from 2025-12-26 01-12-46 Screenshot from 2025-12-26 02-00-36

Faster RecordCursor implementation

I found out that older RecordCursor is growing multiple Vectors too many times.

The older algorithm used two different vectors which one of them was oversized to hold first offset, first offset is actually saved in head_size so we don't need to keep it further.

The new algorithm merges both serial_types and offsets to a single Vector, and relies on header_size instead of reserving one more entry in the vector.

This new implementation reduces number of grows required to adjust the vector size by 2x, This improves performance specially for longer more complex queries with large amount of records.

Faster read_varint_fast

I improved read_varint_fast implementation by reading one single u32 from memory and then doing register operations on it, for loop inside read_varint_fast is going to be unrolled by the compiler.

New read_varint_fast is faster than old one because it continues the failed attempt from index of 4 instead of restarting from index 0 like beffore.

Don't worry about the over reads that happens by the new algorithm as it discards them when it finds the correct value.

I tried many benchmarks and found out LLVM is generating better code for read_varint than anything than I wrote or was there before, actually original read_varint_fast and read_varint had no difference for shorter inputs, and read_varint_fast was actually only slower for longer than 2 varints.

I argue that read_varint_fast should not be in the library, and if we found a better algorithm we should directly improve read_varint.

Improvements

My benchmarks shows about 1-6% gain:

Table for WITHOUT ANALYZE TPC-H:

Query Limbo #11b1494(s) Limbo #4ae8887(s) SQLite (s) Limbo Diff SQLite Diff PR GAIN%
1 6.5235 6.9162 2.7526 0.3928 0.0083 5.68%
2 0.9891 0.9789 0.2763 0.0103 0.0594 -1.05%
3 3.1394 3.1558 1.7783 0.0164 0.0259 0.52%
4 3.7427 3.7541 2.2914 0.0114 0.1029 0.30%
5 2.5826 2.5527 1.8882 0.0299 0.0460 -1.17%
6 1.3643 1.3344 0.8148 0.0298 0.0107 -2.24%
7 7.0085 7.0341 2.2660 0.0256 0.0237 0.36%
8 4.2376 4.2144 4.3655 0.0232 0.0257 -0.55%
9 24.7027 25.0389 5.9760 0.3362 0.0109 1.34%
10 3.3134 3.2412 1.5672 0.0722 0.0092 -2.23%
12 4.4131 4.4596 1.5045 0.0466 0.0181 1.04%
13 6.5144 6.5674 2.5420 0.0530 0.0288 0.81%
14 1.7937 1.7764 1.2860 0.0174 0.0147 -0.98%
16 0.6327 0.6412 0.4263 0.0085 0.0011 1.32%
18 3.2392 3.2277 0.8330 0.0116 0.0023 -0.36%
19 1.7821 1.8034 1.3233 0.0213 0.0036 1.18%
21 13.5405 13.8263 2.4237 0.2858 0.0015 2.07%

comparison_chart_WITHOUT_ANALYZE

HTML Chart:
comparison_chart_WITHOUT_ANALYZE.html

Motivation and context

Improving this great project!

Description of AI Usage

No direct, Slightly for comments and final code review.

benchmarking showed that read_varint is equally fast for small bytes as read_varint_fast. The fast path in read_varint_fast becomes slower when the varint exceeds 2 bytes because it falls back and restarts the calculation.

previous attempts to optimize it further using bit manipulation tricks did not yield improvements. for now, the standard read_varint implementation is the best option. If a superior algorithm is identified in the future, it should replace read_varint directly.
Copy link

@turso-bot turso-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review @jussisaurio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants