-
-
Notifications
You must be signed in to change notification settings - Fork 838
Refactor deserialization to support borrowed types #2648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kryesh
wants to merge
27
commits into
quickwit-oss:main
Choose a base branch
from
kryesh:deserialize_ref2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…under the hood. Add `_raw` API to reader/searcher
… example implementations for `DocumentDeserialize` to existing types
…e `RefValue` to ensure no invalid state from objects or arrays
PSeitz-dd
reviewed
Jun 23, 2025
…aryDocumentDeserializer so it can be used in type definitions
…ak existing applications
Contributor
|
Can you add some details about the motivation of the PR? |
Contributor
Author
Updated the top comment |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds support for deserializing documents that contain borrowed types.
The primary motivation for this is performance when deserializing many documents at once, where allocations and deallocations for variable-length types such as
Stringbecome a significant part of program runtime.I implemented this to use in my own program "Crystalline" where I read many documents from Tantivy indices as part of search operations. Crystalline searches behave similarly to that of a SIEM solution with a piped query language, using Tantivy to perform keyword searches under the hood to reduce the number of retrieved documents. With the upstream version of Tantivy it was able to read on the order of ~5GB/s of log data from a sample database. Using the patches here this increased to ~6GB/s on the same database. Having string values as references enabled further changes to Crystalline like using bump allocation to allocate memory for many documents at once, with these changes in place it can now retrieve documents from the same dataset at ~7.5GB/s (8.4m docs/s) on the same hardware (VM with 12 threads hosted on a Ryzen 7 7700x reading from network storage).
Some relevant parts of Crystalline to show how I'm using the changes in this PR:
Custom document type with references:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/schema/event.rs#L40
Retrieving documents from a reader:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/bucket/reader.rs#L97
Deserializing batches of documents into a bump allocator:
https://codeberg.org/Kryesh/crystalline/src/commit/e64aa16593f2c34cd8b2ce9e52262afd81e28ca9/server/src/store/index/searcher.rs#L136
What's in this PR?
RefReaderwhich wrapsOwnedBytesand maintains a cursor without changing the internal slice likeOwnedBytes's read implementation.DocumentDeserializeReftrait for types that support being deserialized by referenceDocumentDeserializeBinaryRefDeserializabletrait for primitive types that support being deserialized from aRefReaderRefValueTryFrom<RefValue>instead of visitorsOwnedValueandserde_json::ValueBinaryValueDeserializerwith a simple function that returns the next value from aRefReaderSearcherto expose the updatedBinaryDocumentDeserializerstruct publicly and a newdoc_rawmethod to retrieve theBinaryDocumentDeserializerfor a document rather than deserializing it directlyArcfor the decompressed document allows users of the api to control when they will be impacted by the borrow checker rather than requiring a reference to be returned viaSearcher