Refactor deserialization to support borrowed types #2648

kryesh · 2025-06-15T00:41:53Z

This PR adds support for deserializing documents that contain borrowed types.
The primary motivation for this is performance when deserializing many documents at once, where allocations and deallocations for variable-length types such as String become a significant part of program runtime.

I implemented this to use in my own program "Crystalline" where I read many documents from Tantivy indices as part of search operations. Crystalline searches behave similarly to that of a SIEM solution with a piped query language, using Tantivy to perform keyword searches under the hood to reduce the number of retrieved documents. With the upstream version of Tantivy it was able to read on the order of ~5GB/s of log data from a sample database. Using the patches here this increased to ~6GB/s on the same database. Having string values as references enabled further changes to Crystalline like using bump allocation to allocate memory for many documents at once, with these changes in place it can now retrieve documents from the same dataset at ~7.5GB/s (8.4m docs/s) on the same hardware (VM with 12 threads hosted on a Ryzen 7 7700x reading from network storage).

Some relevant parts of Crystalline to show how I'm using the changes in this PR:
Custom document type with references:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/schema/event.rs#L40
Retrieving documents from a reader:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/bucket/reader.rs#L97
Deserializing batches of documents into a bump allocator:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/searcher.rs#L136

What's in this PR?

Added a new struct RefReader which wraps OwnedBytes and maintains a cursor without changing the internal slice like OwnedBytes's read implementation.
Added a new DocumentDeserializeRef trait for types that support being deserialized by reference
- Automatically implemented for types that implement the existing DocumentDeserialize
Added a new BinaryRefDeserializable trait for primitive types that support being deserialized from a RefReader
Refactored document/value deserialization away from the visitor pattern in favor of iteration over an enum RefValue
- Document value deserialization is now done via TryFrom<RefValue> instead of visitors
- Updated implementations for provided types such as OwnedValue and serde_json::Value
- Replaced BinaryValueDeserializer with a simple function that returns the next value from a RefReader
- Removed intermediate allocations and conversions when deserializing legacy objects
Updated the public api of Searcher to expose the updated BinaryDocumentDeserializer struct publicly and a new doc_raw method to retrieve the BinaryDocumentDeserializer for a document rather than deserializing it directly
- Returning a struct that owns an Arc for the decompressed document allows users of the api to control when they will be impacted by the borrow checker rather than requiring a reference to be returned via Searcher

…under the hood. Add `_raw` API to reader/searcher

… example implementations for `DocumentDeserialize` to existing types

… allocating

…e `RefValue` to ensure no invalid state from objects or arrays

common/src/refread.rs

…aryDocumentDeserializer so it can be used in type definitions

…ak existing applications

PSeitz-dd · 2025-12-31T08:40:39Z

Can you add some details about the motivation of the PR?

kryesh · 2026-01-01T04:01:54Z

Can you add some details about the motivation of the PR?

Updated the top comment

PSeitz-dd and others added 14 commits June 14, 2025 21:26

chore: Release

253a0c6

chore: Release

bb4f317

Add RefReader struct

28ef215

Refactor deserialization to use references for string and byte types …

c9c0db7

…under the hood. Add `_raw` API to reader/searcher

Simplify and consolidate deserialization logic via reference value type

0d91b0d

Add legacy JSON object support to RefValue

0a823f0

Fix merge resolve error

c4942f2

Update docs

e708bcc

Simplify DocumentDeserialize trait lifetimes

7eda239

Move RefValue to be with the rest of the deserialization logic. Add…

7326185

… example implementations for `DocumentDeserialize` to existing types

Make RefValue store a reference for PreTokenizedString instead of…

2687223

… allocating

Update docs

da72735

Update Deserializer iterator generics to only allow types that consum…

f4f035b

…e `RefValue` to ensure no invalid state from objects or arrays

Clean up HashMap document deserialization

0dee764

PSeitz-dd reviewed Jun 23, 2025

View reviewed changes

common/src/refread.rs Show resolved Hide resolved

kryesh added 10 commits June 24, 2025 09:40

Cargo fmt

09152bb

Remove legacy methods from RefReader

244c708

Add missing trait to fix tests

de79957

Remove unused method from RefReader

2a33969

Allow dead code in MyCustomDocument doc test

d2cad8c

Add position reset to DocumentDeserializer

e99ea60

Update public API to remain compatible with existing code, expose Bin…

a3ec80e

…aryDocumentDeserializer so it can be used in type definitions

Update docs

00a71ef

Expose BinaryDocumentDeserializer instead of BinaryDocumentSerializer

a647ad6

Update refreader to also be support position reset

86fcfe2

kryesh force-pushed the deserialize_ref2 branch from 579b2bb to 86fcfe2 Compare July 5, 2025 01:52

kryesh added 2 commits July 16, 2025 13:16

Revert back to OwnedValue implementations for default maps to not bre…

5c71ce1

…ak existing applications

Add tests for RefReader

2a4b3f1

kryesh marked this pull request as ready for review July 19, 2025 08:32

Merge branch 'main' into deserialize_ref2

aa7ade6

kryesh requested a review from PSeitz-dd October 13, 2025 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Refactor deserialization to support borrowed types #2648

Refactor deserialization to support borrowed types #2648

Uh oh!

kryesh commented Jun 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

PSeitz-dd commented Dec 31, 2025

Uh oh!

kryesh commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Refactor deserialization to support borrowed types #2648

Are you sure you want to change the base?

Refactor deserialization to support borrowed types #2648

Uh oh!

Conversation

kryesh commented Jun 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

PSeitz-dd commented Dec 31, 2025

Uh oh!

kryesh commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kryesh commented Jun 15, 2025 •

edited

Loading