Dynamic Batching #261

achirkin · 2024-07-30T08:09:17Z

Non-blocking / stream-ordered dynamic batching as a new index type.

API

This PR implements dynamic batching as a new index type, mirroring the API of other indices.

[building is wrapping] Building the index means creating a lightweight wrapper on top of an existing index and initializing necessary components, such as IO batch buffers and synchronization primitives.
[type erasure] The underlying/upstream index type is erased once the dynamic_batching wrapper is created, i.e. there's no way to recover the original search index type or parameters.
[explicit control over batching] To allow multiple user requests group into a dynamic batch request, the users must use copies of the same dynamic batching index (the user-facing index type is a thin wrapper on top of a shared pointer, hence the copy is shallow and cheap). The search function is thread-safe.

Feature: stream-ordered dynamic batching

Non-blocking / stream-ordered dynamic batching means the batching does not involve synchronizing with a GPU stream. The control is returned to the user as soon as the necessary work is submitted to the GPU. This entails a few good-to-know features:

The dynamic batching index has the same blocking properties as the upstream index: if the upstream index does not involve stream sync during search, that the dynamic batching index does not involve it as well (otherwise, the dynamic batching search obviously waits till the upstream search synchronizes under the hood).
It's responsibility of the user to synchronize the stream before getting the results back - even if the upstream index search does not need it (the batch results are scattered back to the request threads in a post-processing kernel).
If the upstream index does not synchronize during search, the dynamic batching index can group the queries even in a single-threaded application (try it with --no-lap-sync option in the ann-bench benchmarks).

Overall, stream-ordered dynamic batching makes it easy to modify existing cuVS indexes, because the wrapped index has the same execution behavior as the upstream index.

Work-in-progress TODO

Add dynamic batching option to more indices in ann-bench
Add tests
(postponed to 25.02) Do proper benchmarking and possibly fine-tune the inter-thread communication
Review the API side (cpp/include/cuvs/neighbors/dynamic_batching.hpp) [ready for review CC @cjnolet]
Review the algorithm side (cpp/src/neighbors/detail/dynamic_batching.cuh) [ready for preliminary review: requests for algorithm docsting/clarifications are especially welcome]

…the time the batch is dispatched

…-batching

…lock

…of the centralized batch submit counter

…among CPU threads

…uffers)

achirkin · 2024-10-23T14:12:41Z

Sneak-peek into performance (single-query benchmarks on a workstation):

cjnolet · 2024-11-20T21:45:05Z

cpp/src/neighbors/dynamic_batching.cu

+    return index.runner->search(res, params, queries, neighbors, distances);       \
+  }
+
+CUVS_INST_DYNAMIC_BATCHING_INDEX(float, uint32_t, cuvs::neighbors::cagra, index<float, uint32_t>);


It's really unfortunate that we'll need to instantiate these individually for each index type. For example, Vamana is not included here. Is there any way we can remove this constaint? Can we just tie this to the search_params super class?

I agree and I see couple solutions to this.

One is to go with the class-based polymorphism.
Then we'd have to make the search parameters neighbors::search_params and the index type neighbohrs::index virtual, by adding the virtual destructor type. We will also need a virtual clone() method, so we can copy implementation search parameters via the base class. This goes slightly against our initial design of keeping the search parameters a POD. This also means it would be dangerous to pass search parameters struct to kernels (but I think we haven't been doing this so far).
Then we would also need to add virtual search method to the index (and also dim() which is currently used by the dynamic batching), which goes slightly against our initial design of having search/build functions as plain functions.
Then there will be only one, non-templated dynamic_batching constructor taking the abstract upstream index and search parameters.

Another solution is to go with the template-based polymorphism.
I could define a template constructor in the public header file (similar to what I have in the dynamic_batching_test at the moment). It would take the index, search params, and the search function as the template parameters, so that users can instantiate it on the user side. I think I would have to slightly rework the detail::batch_runner and expose at least one of its constructors in the public header for that.
This obviously goes against our design of not instantiating anything on the user side (but it doesn't involve any cuda-specific code and should be fast).

I think the template-based solution would be slightly less disruptive to cuVS in general, but I also probably we should take this to a follow-on PR and a separate discussion for 25.02

cpp/include/cuvs/neighbors/dynamic_batching.hpp

cjnolet · 2024-11-20T21:50:06Z

cpp/include/cuvs/neighbors/ivf_flat.hpp

@@ -138,6 +138,10 @@ using list_data = ivf::list<list_spec, SizeT, ValueT, IdxT>;
 */
 template <typename T, typename IdxT>
 struct index : cuvs::neighbors::index {
+  using index_params_type  = ivf_flat::index_params;


Please also include the other index types (e.g. bfknn, vamana, etc...).

I've checked all available indexes and, unfortunately, I cannot add any more instances at this point:

bfknn would be the best candidate, but its build/search methods do not adhere to the expected api (missing index/search params as function arguments).

vamana and nn_descend do not have corresponding search functions, which are needed for instantiating the dynamic batching index.

We can adapt the code to allow bfknn, but it would require more changes. I suggest we postpone this for 25.02 together with #261 (comment)

cpp/test/CMakeLists.txt

tfeher

I have reviewed the implementation details. Thanks Artem for the additional documentation, overall the code looks great.

To achieve high throughput and low latency, one has to watch out for intricate details for queuing and synchronization, which makes the implementation complex. I have left a few comments that requests additional explanation, and suggests potential refactoring to make the logic easier to follow.

cpp/bench/ann/src/cuvs/cuvs_ann_bench_param_parser.h