-
Notifications
You must be signed in to change notification settings - Fork 117
Dynamic Batching #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic Batching #261
Conversation
…the time the batch is dispatched
…of the centralized batch submit counter
…among CPU threads
return index.runner->search(res, params, queries, neighbors, distances); \ | ||
} | ||
|
||
CUVS_INST_DYNAMIC_BATCHING_INDEX(float, uint32_t, cuvs::neighbors::cagra, index<float, uint32_t>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really unfortunate that we'll need to instantiate these individually for each index type. For example, Vamana is not included here. Is there any way we can remove this constaint? Can we just tie this to the search_params super class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree and I see couple solutions to this.
One is to go with the class-based polymorphism.
Then we'd have to make the search parameters neighbors::search_params
and the index type neighbohrs::index
virtual, by adding the virtual destructor type. We will also need a virtual clone()
method, so we can copy implementation search parameters via the base class. This goes slightly against our initial design of keeping the search parameters a POD. This also means it would be dangerous to pass search parameters struct to kernels (but I think we haven't been doing this so far).
Then we would also need to add virtual search
method to the index (and also dim()
which is currently used by the dynamic batching), which goes slightly against our initial design of having search/build functions as plain functions.
Then there will be only one, non-templated dynamic_batching
constructor taking the abstract upstream index and search parameters.
Another solution is to go with the template-based polymorphism.
I could define a template constructor in the public header file (similar to what I have in the dynamic_batching_test
at the moment). It would take the index, search params, and the search function as the template parameters, so that users can instantiate it on the user side. I think I would have to slightly rework the detail::batch_runner
and expose at least one of its constructors in the public header for that.
This obviously goes against our design of not instantiating anything on the user side (but it doesn't involve any cuda-specific code and should be fast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the template-based solution would be slightly less disruptive to cuVS in general, but I also probably we should take this to a follow-on PR and a separate discussion for 25.02
@@ -138,6 +138,10 @@ using list_data = ivf::list<list_spec, SizeT, ValueT, IdxT>; | |||
*/ | |||
template <typename T, typename IdxT> | |||
struct index : cuvs::neighbors::index { | |||
using index_params_type = ivf_flat::index_params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also include the other index types (e.g. bfknn, vamana, etc...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked all available indexes and, unfortunately, I cannot add any more instances at this point:
- bfknn would be the best candidate, but its build/search methods do not adhere to the expected api (missing index/search params as function arguments).
- vamana and nn_descend do not have corresponding
search
functions, which are needed for instantiating the dynamic batching index.
We can adapt the code to allow bfknn, but it would require more changes. I suggest we postpone this for 25.02 together with #261 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reviewed the implementation details. Thanks Artem for the additional documentation, overall the code looks great.
To achieve high throughput and low latency, one has to watch out for intricate details for queuing and synchronization, which makes the implementation complex. I have left a few comments that requests additional explanation, and suggests potential refactoring to make the logic easier to follow.
const auto seq_id = batch_queue_.head(); | ||
const auto commit_result = try_commit(seq_id, n_queries); | ||
// The bool (busy or not) returned if no queries were committed: | ||
if (std::holds_alternative<bool>(commit_result)) { | ||
// Pause if the system is busy | ||
// (otherwise the progress is guaranteed due to update of the head counter) | ||
if (std::get<bool>(commit_result)) { to_commit.wait(); } | ||
continue; // Try to get a new batch token | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unaware of the intricacies of how to queue the work, but it seems that we are doin queue state management at multiple levels: head()
, is checking tail position and potentially waits, try_commit()
is checking batch_status and maybe commits, maybe not, here in the loop we are checking the status and potentially waiting and trying again.
To keep the code simple, it would be great if try_commit
not just tries, but actually commits it by moving this logit there.
But if there is a good reason to organize the logic this way, that could be also fine, after all this is an implementation detail.
// The interpretation of the token status depends on the current seq_order_id and a similar | ||
// counter in the token. This is to prevent conflicts when too many parallel requests wrap | ||
// over the whole ring buffer (batch_queue_t). | ||
token_status = batch_queue::batch_status(batch_token_observed, seq_id); | ||
// Busy status means the current thread is a whole ring buffer ahead of the token. | ||
// The thread should wait for the rest of the system. | ||
if (token_status == slot_state::kFullBusy || token_status == slot_state::kEmptyBusy) { | ||
return true; | ||
} | ||
// This branch checks if the token was recently filled or dispatched. | ||
// This means the head counter of the ring buffer is slightly outdated. | ||
if (token_status == slot_state::kEmptyPast || token_status == slot_state::kFullPast || | ||
batch_token_observed.size_committed() >= max_batch_size_) { | ||
batch_queue_.pop(seq_id); | ||
return false; | ||
} | ||
batch_token_updated = batch_token_observed; | ||
batch_token_updated.size_committed() = | ||
std::min(batch_token_observed.size_committed() + n_queries, max_batch_size_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user of the queue have to be aware of all the possible states? Can't we hide this as implementation detail of the queue? In other words, could we have a head() function which simply return a slot that is valid, and move these state comparison details into the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, you're right, we can do this, but there's been two things preventing me from doing that so far:
- The queue is not aware of the
max_batch_size
, but it's needed for the commit logic - We still need the states when we're waiting for the IO buffer after committing to the batch (see Note: waiting for batch IO buffers)
None of the two seem to be complete blockers though.
local_waiter till_full{std::chrono::nanoseconds(size_t(params.dispatch_timeout_ms * 1e5)), | ||
batch_queue_.niceness(seq_id)}; | ||
while (batch_queue::batch_status(batch_token_observed, seq_id) != slot_state::kFull) { | ||
/* Note: waiting for batch IO buffers | ||
The CPU threads can commit to the incoming batches in the queue in advance (this happens in | ||
try_commit). | ||
In this loop, a thread waits for the batch IO buffer to be released by a running search on | ||
the GPU side (scatter_outputs kernel). Hence, this loop is engaged only if all buffers are | ||
currently used, which suggests that the GPU is busy (or there's not enough IO buffers). | ||
This also means the current search is not likely to meet the deadline set by the user. | ||
|
||
The scatter kernel returns its buffer id into an acquired slot in the batch queue; in this | ||
loop we wait for that id to arrive. | ||
|
||
Generally, we want to waste as little as possible CPU cycles here to let other threads wait | ||
on dispatch_sequence_id_ref below more efficiently. At the same time, we shouldn't use | ||
`.wait()` here, because `.notify_all()` would have to come from GPU. | ||
*/ | ||
till_full.wait(); | ||
batch_token_observed = batch_token_ref.load(cuda::std::memory_order_acquire); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to a helper function of batch_queue
, to keep this state checking an internal detail of the queue?
Co-authored-by: Tamas Bela Feher <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Artem for the added documentation. The PR looks good to me. Please open an issue for refactoring ideas that are feasible to implement as a follow up.
/merge |
Non-blocking / stream-ordered dynamic batching as a new index type.
API
This PR implements dynamic batching as a new index type, mirroring the API of other indices.
Feature: stream-ordered dynamic batching
Non-blocking / stream-ordered dynamic batching means the batching does not involve synchronizing with a GPU stream. The control is returned to the user as soon as the necessary work is submitted to the GPU. This entails a few good-to-know features:
Overall, stream-ordered dynamic batching makes it easy to modify existing cuVS indexes, because the wrapped index has the same execution behavior as the upstream index.
Work-in-progress TODO
cpp/include/cuvs/neighbors/dynamic_batching.hpp
) [ready for review CC @cjnolet]cpp/src/neighbors/detail/dynamic_batching.cuh
) [ready for preliminary review: requests for algorithm docsting/clarifications are especially welcome]