Skip to content

Conversation

@kryesh
Copy link
Contributor

@kryesh kryesh commented Jun 15, 2025

This PR adds support for deserializing documents that contain borrowed types.
The primary motivation for this is performance when deserializing many documents at once, where allocations and deallocations for variable-length types such as String become a significant part of program runtime.

I implemented this to use in my own program "Crystalline" where I read many documents from Tantivy indices as part of search operations. Crystalline searches behave similarly to that of a SIEM solution with a piped query language, using Tantivy to perform keyword searches under the hood to reduce the number of retrieved documents. With the upstream version of Tantivy it was able to read on the order of ~5GB/s of log data from a sample database. Using the patches here this increased to ~6GB/s on the same database. Having string values as references enabled further changes to Crystalline like using bump allocation to allocate memory for many documents at once, with these changes in place it can now retrieve documents from the same dataset at ~7.5GB/s (8.4m docs/s) on the same hardware (VM with 12 threads hosted on a Ryzen 7 7700x reading from network storage).

Some relevant parts of Crystalline to show how I'm using the changes in this PR:
Custom document type with references:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/schema/event.rs#L40
Retrieving documents from a reader:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/bucket/reader.rs#L97
Deserializing batches of documents into a bump allocator:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/searcher.rs#L136

What's in this PR?

  • Added a new struct RefReader which wraps OwnedBytes and maintains a cursor without changing the internal slice like OwnedBytes's read implementation.
  • Added a new DocumentDeserializeRef trait for types that support being deserialized by reference
    • Automatically implemented for types that implement the existing DocumentDeserialize
  • Added a new BinaryRefDeserializable trait for primitive types that support being deserialized from a RefReader
  • Refactored document/value deserialization away from the visitor pattern in favor of iteration over an enum RefValue
    • Document value deserialization is now done via TryFrom<RefValue> instead of visitors
    • Updated implementations for provided types such as OwnedValue and serde_json::Value
    • Replaced BinaryValueDeserializer with a simple function that returns the next value from a RefReader
    • Removed intermediate allocations and conversions when deserializing legacy objects
  • Updated the public api of Searcher to expose the updated BinaryDocumentDeserializer struct publicly and a new doc_raw method to retrieve the BinaryDocumentDeserializer for a document rather than deserializing it directly
    • Returning a struct that owns an Arc for the decompressed document allows users of the api to control when they will be impacted by the borrow checker rather than requiring a reference to be returned via Searcher

@kryesh kryesh force-pushed the deserialize_ref2 branch from 579b2bb to 86fcfe2 Compare July 5, 2025 01:52
@kryesh kryesh marked this pull request as ready for review July 19, 2025 08:32
@kryesh kryesh requested a review from PSeitz-dd October 13, 2025 09:40
@PSeitz-dd
Copy link
Contributor

Can you add some details about the motivation of the PR?

@kryesh
Copy link
Contributor Author

kryesh commented Jan 1, 2026

Can you add some details about the motivation of the PR?

Updated the top comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants