Optimized RecordCursor, Remove read_varint_fast #4363

fereidani · 2025-12-26T22:43:37Z

Description

This is done to improve op_column and ensure_parsed_upto which are performance intensive functions, and remove multiple memory allocations.

This implementation reduces memory allocations of these functions by 2x and also simplifies implementation and removes many redundant checks.

Faster RecordCursor implementation

I found out that older RecordCursor is growing multiple Vectors too many times.

The older algorithm used two different vectors which one of them was oversized to hold first offset, first offset is actually saved in head_size so we don't need to keep it further.

The new algorithm merges both serial_types and offsets to a single Vector, and relies on header_size instead of reserving one more entry in the vector.

This new implementation reduces number of grows required to adjust the vector size by 2x, This improves performance specially for longer more complex queries with large amount of records.

Faster read_varint_fast

~~I improved read_varint_fast implementation by reading one single u32 from memory and then doing register operations on it, for loop inside read_varint_fast is going to be unrolled by the compiler.~~

~~New read_varint_fast is faster than old one because it continues the failed attempt from index of 4 instead of restarting from index 0 like beffore.~~

~~Don't worry about the over reads that happens by the new algorithm as it discards them when it finds the correct value.~~

I tried many benchmarks and found out LLVM is generating better code for read_varint than anything than I wrote or was there before, actually original read_varint_fast and read_varint had no difference for shorter inputs, and read_varint_fast was actually only slower for longer than 2 varints.

I argue that read_varint_fast should not be in the library, and if we found a better algorithm we should directly improve read_varint.

Improvements

My benchmarks shows about 1-6% gain:

Table for WITHOUT ANALYZE TPC-H:

Query	Limbo #11b1494(s)	Limbo #4ae8887(s)	SQLite (s)	Limbo Diff	SQLite Diff	PR GAIN%
1	6.5235	6.9162	2.7526	0.3928	0.0083	5.68%
2	0.9891	0.9789	0.2763	0.0103	0.0594	-1.05%
3	3.1394	3.1558	1.7783	0.0164	0.0259	0.52%
4	3.7427	3.7541	2.2914	0.0114	0.1029	0.30%
5	2.5826	2.5527	1.8882	0.0299	0.0460	-1.17%
6	1.3643	1.3344	0.8148	0.0298	0.0107	-2.24%
7	7.0085	7.0341	2.2660	0.0256	0.0237	0.36%
8	4.2376	4.2144	4.3655	0.0232	0.0257	-0.55%
9	24.7027	25.0389	5.9760	0.3362	0.0109	1.34%
10	3.3134	3.2412	1.5672	0.0722	0.0092	-2.23%
12	4.4131	4.4596	1.5045	0.0466	0.0181	1.04%
13	6.5144	6.5674	2.5420	0.0530	0.0288	0.81%
14	1.7937	1.7764	1.2860	0.0174	0.0147	-0.98%
16	0.6327	0.6412	0.4263	0.0085	0.0011	1.32%
18	3.2392	3.2277	0.8330	0.0116	0.0023	-0.36%
19	1.7821	1.8034	1.3233	0.0213	0.0036	1.18%
21	13.5405	13.8263	2.4237	0.2858	0.0015	2.07%

HTML Chart:
comparison_chart_WITHOUT_ANALYZE.html

Motivation and context

Improving this great project!

Description of AI Usage

No direct, Slightly for comments and final code review.

benchmarking showed that read_varint is equally fast for small bytes as read_varint_fast. The fast path in read_varint_fast becomes slower when the varint exceeds 2 bytes because it falls back and restarts the calculation. previous attempts to optimize it further using bit manipulation tricks did not yield improvements. for now, the standard read_varint implementation is the best option. If a superior algorithm is identified in the future, it should replace read_varint directly.

turso-bot

Please review @jussisaurio

core/Cargo.toml

fereidani added 12 commits December 26, 2025 21:25

add branches as dependency

621af55

optimize read_varint_fast

b03285a

make all error paths cold

657edbb

add better RecordCursor implementation/algorithm, use read_varint_fast

3487e21

use the new RecordCursor in ImmutableRecord

1b0e7a0

use read_varint_fast in ImmutableRecord

69365f5

remove obsolete tests

7ffe3ae

fix tests for RecordCursor

24468ec

refactor sorter.rs to use the new RecordCursor

9d998fe

refactor execute.rs to use the new RecordCursor

4ba3bdd

fix macro rules crate to $crate

1c7c0c7

fereidani requested review from jussisaurio, penberg and pereman2 as code owners December 26, 2025 22:43

turso-bot bot reviewed Dec 26, 2025

View reviewed changes

github-actions bot added core vdbe Storage labels Dec 26, 2025

PThorpe92 reviewed Dec 27, 2025

View reviewed changes

core/Cargo.toml Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized RecordCursor, Remove read_varint_fast #4363

Optimized RecordCursor, Remove read_varint_fast #4363

fereidani commented Dec 26, 2025 •

edited

Loading

Uh oh!

turso-bot bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Optimized RecordCursor, Remove read_varint_fast #4363

Are you sure you want to change the base?

Optimized RecordCursor, Remove read_varint_fast #4363

Conversation

fereidani commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Faster RecordCursor implementation

Faster read_varint_fast

Improvements

Table for WITHOUT ANALYZE TPC-H:

Motivation and context

Description of AI Usage

Uh oh!

turso-bot bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fereidani commented Dec 26, 2025 •

edited

Loading