[PERF] prefetch blocks for fulltext index writer#3370
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
HammadB
left a comment
There was a problem hiding this comment.
I think @sanketkedia should weigh in on this before merge, as we have operators for prefetch and established patterns for it
I will ask
|
8b144d3 to
154c080
Compare
8ed8d3b to
931a5f2
Compare
931a5f2 to
323b15e
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. |
ce5521e to
2c3884c
Compare
323b15e to
31ee004
Compare
31ee004 to
f7ecfd9
Compare
2c3884c to
6ae644d
Compare
f7ecfd9 to
a13d405
Compare
a13d405 to
8c7288e
Compare
e0f7533 to
577da0e
Compare
| let num_block_cache_hits = self | ||
| .num_block_cache_hits | ||
| .load(std::sync::atomic::Ordering::Relaxed); | ||
| let num_block_cache_misses = self |
There was a problem hiding this comment.
Foyer already exposes the underlying hit rate and miss rate - i think this is redundant?
chroma/rust/cache/src/foyer.rs
Line 251 in 8f7a0ff
There was a problem hiding this comment.
Also if we are doing it - i i think it should be a metric not a trace
There was a problem hiding this comment.
ah yep missed that
I think it's actually kinda useful having it on the trace; if all you have is a metric point with a low cache hit rate it could be pretty hard to debug
will leave out for now
77c6a76 to
f7c73c4
Compare
| let task = wrap(operator, input, ctx.receiver()); | ||
| self.send(task, ctx).await; | ||
|
|
||
| let segments = self.get_all_segments().await.unwrap(); |
There was a problem hiding this comment.
(Not putting this in the initial_tasks() method from the Orchestrator trait as that would require making initial_tasks() async and my understanding is that it's intended to be a static list of tasks. A cleaner solution would perhaps be adding a GetSegment operator and dispatching that from initial_tasks().)
c547c2f to
a264c8f
Compare
| Ok(()) | ||
| } | ||
|
|
||
| pub async fn prefetch(&self, id: &uuid::Uuid) -> Result<usize, Box<dyn ChromaError>> { |
There was a problem hiding this comment.
this is a new top-level API, not attached to a reader or writer
I think this makes more sense semantically since we're not fetching the blocks for any particular reader or writer instance and it lets us avoid K & V generics
a264c8f to
8241a65
Compare
| } | ||
|
|
||
| pub async fn prefetch(&self, id: &Uuid) -> Result<usize, ArrowBlockfileProviderPrefetchError> { | ||
| let block_ids = self |
There was a problem hiding this comment.
Why not just fetch the root and then do this? The abstractions feel a bit off to me here.
There was a problem hiding this comment.
Fetching the root requires K to be known, which would break a general-purpose prefetch operator that has no knowledge of the concrete K/V types for each file path.
There was a problem hiding this comment.
ah that sucks, might be nice to comment that
|
Barring any fallout from my question above this makes sense to me. Thanks for the attention to quality. |
3440b66 to
8241a65
Compare
## Description of changes These changes were originally part of #3370 but decided to break them out as they're no longer needed for that PR. Makes the interface slightly more flexible and puts the burden of owned data on the caller. ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* n/a
## Description of changes These changes were originally part of chroma-core#3370 but decided to break them out as they're no longer needed for that PR. Makes the interface slightly more flexible and puts the burden of owned data on the caller. ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes *Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the [docs repository](https://github.com/chroma-core/docs)?* n/a

Description of changes
Prefetchs blocks in parallel when writing the full text index rather than fetching the blocks sequentially on-demand. Fetching in parallel saves us quite a bit of time and having prefetch as a distinct operator also allows us to mark it as I/O.
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustTested by disabling the write-through cache and observing that the prefetch step pulled in the correct blocks.
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs repository?
n/a