Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #846

jamxia155 · 2025-04-25T17:33:25Z

(This supersedes PR!703)

Added an optional input flag to cuvs::neighbors::vamana::serialize to dump an input cuvs Vamana index to file with SSD sector alignment. File format follows MSFT DiskANN.

Using the sector-aligned option also writes out the quantized dataset computed using user-supplied PQ codebooks file and rotation matrix file.

copy-pr-bot · 2025-04-25T17:33:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cjnolet · 2025-04-25T18:09:42Z

/ok to test d0aabc6

cpp/src/neighbors/detail/vamana/vamana_build.cuh

cpp/src/neighbors/detail/vamana/vamana_serialize.cuh

tarang-jain · 2025-05-02T00:28:58Z

cpp/include/cuvs/neighbors/common.hpp

+auto make_strided_dataset(const raft::resources& res,
+                          const SrcT& src,
+                          uint32_t required_stride,
+                          bool force_ownership = false)


I am not opposed to this flag, but what is the reasoning behind the force_ownership flag? From my understanding, if the stride matches and the ptr is device accessible it should be okay to make it non-owning. And it is the calling process's responsibility to ensure the dataset is live until the end if the strided dataset is non-owning.

As you might've suspected, setting this flag is only needed in specific situations like this: during vamana::index::index(), the function optionally computes a matrix for the quantized vectors using additional input files pointed to by index_params::codebook_prefix. Since this matrix is created within index(), its lifetime is limited to the index build (as opposed to the full dataset matrix which is provided to index() as an input) so we need to force ownership of the matrix. It may be possible to improve this by somehow moving the data instead of copying but that is not the focus for now.

I agree with @tarang-jain here. I think there's a cleaner way to do this. Please don't treat public APIs like this as a "just need to get this done". These APIs tend to stick around for a long time and we need to make sure they are as clean and flexible as they can be. In fact, we often spend more time discussing our public APIs than we do the impl details because once these are exposed they are very hard to change.

We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data device matrix as a member of create/destroy it during the vamana::index constructor/destructor. Does this seem like a reasonable solution to everyone?

That seems like a reasonable solution to me.

Updated as discussed.

tarang-jain · 2025-05-02T00:33:34Z

cpp/include/cuvs/neighbors/vamana.hpp

@@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index {
    return *dataset_;
  }

+  /** Quantized dataset [size, codes_rowlen] */
+  [[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>&
+  {


device_matrix_view or host_matrix_view as the output type, rather than a dataset type object.

I wanted to model this function after vamana::index::data() so that we can reuse the code in the serializer. Is there a reason we shouldn't return dataset from quantized_data() while it's ok to do so from data()?

The return type of vamana::index::data() might have been an oversight. Here's what I think. quantized_data() is an important public API on the index. As a new cuvs user, one should not have to familiarize themselves with obscure data types such as dataset. My understanding is that those are for internal use-cases or to be used by other rapids repos. mdspan return types are a lot more user-friendly / readable on the other hand.

While I agree that it would be nice to let users avoid the dataset type, the data() API call by other algorithms also returns a dataset object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.

Ah I did not realize that data() returns a dataset object in CAGRA too. In that case vamana::index::data() should be fine. But maybe the quantized data can be returned as a matrix. That is just my opinion, but @cjnolet can help design the API.

Updated to const device_matrix_view.

cpp/src/neighbors/detail/vamana/vamana_build.cuh

cpp/src/neighbors/detail/vpq_dataset.cuh

…na_serialize_quantize_build

cpp/src/neighbors/detail/vamana/vamana_build.cuh

bkarsin · 2025-05-16T20:28:44Z

cpp/include/cuvs/neighbors/vamana.hpp

@@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index {
    return *dataset_;
  }

+  /** Quantized dataset [size, codes_rowlen] */
+  [[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>&
+  {


While I agree that it would be nice to let users avoid the dataset type, the data() API call by other algorithms also returns a dataset object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.

bkarsin · 2025-05-16T20:29:39Z

cpp/include/cuvs/neighbors/common.hpp

+auto make_strided_dataset(const raft::resources& res,
+                          const SrcT& src,
+                          uint32_t required_stride,
+                          bool force_ownership = false)


We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data device matrix as a member of create/destroy it during the vamana::index constructor/destructor. Does this seem like a reasonable solution to everyone?

bkarsin · 2025-05-16T20:32:18Z

cpp/src/neighbors/detail/vamana/vamana_build.cuh

@@ -408,9 +533,117 @@ index<T, IdxT> build(
  batched_insert_vamana<T, float, IdxT, Accessor>(
    res, params, dataset, vamana_graph.view(), &medoid_id, metric);

+  std::optional<raft::device_matrix<uint8_t, int64_t>> quantized_vectors;
+  if (params.codebook_prefix.size()) {
+    cuvs::neighbors::vpq_params pq_params;


Question for @cjnolet - is the vpq API changing with 25.06 or in the future with the release of the new quantization API? If so, is CAGRA-Q having to change (which means this would have to change as well)?

bkarsin · 2025-05-16T20:49:49Z

cpp/src/neighbors/detail/vamana/vamana_serialize.cuh

+ *
+ */
+template <typename T, typename IdxT, typename HostMatT>
+void serialize_sector_aligned(raft::resources const& res,


The files created, their names, and how to use them with CPU DiskANN search is quite complex. Added an issue to add documentation for this as a future item: #906
Not a blocker for this PR, though.

…na_serialize_quantize_build

cjnolet · 2025-06-10T15:28:31Z

@jamxia155 I sincerely apologize for the delayed response here. We had some critical issues for release 25.06 that took several folks from the team over a week to resolve. Good news is that the release is now complete and we're available to help with timely reviews and guidance.

I notice the last several commits are related to the cmake download of the test files. These will need to work both locally for users, as well as in CI. Are they working locally for you?

jamxia155 · 2025-06-10T15:32:21Z

I notice the last several commits are related to the cmake download of the test files. These will need to work both locally for users, as well as in CI. Are they working locally for you?

Hi @cjnolet, those changes worked locally for me but are not working in the CI yet. However, I have since received helpful guidance from @bdice on the pattern that's currently used in cuGraph and I think I have a good handle on what needs to be done now. Thanks.

…na_serialize_quantize_build

bdice · 2025-06-11T15:13:54Z

ci/test_cpp.sh

+mkdir -p "${RAPIDS_DATASET_ROOT_DIR}"
+export RAPIDS_DATASET_ROOT_DIR
+pushd "${RAPIDS_DATASET_ROOT_DIR}"
+${GITHUB_WORKSPACE}/cpp/tests/get_test_data.sh --NEIGHBORS_ANN_VAMANA_TEST


This will only work in GitHub Actions and will fail when reproducing CI locally. We need a solution that doesn't involve ${GITHUB_WORKSPACE}.

I would make the get_test_data.sh script aware of ${RAPIDS_DATASET_ROOT_DIR}. That way it has a default location in which it downloads, and you can continue to call that script from the repository root with ./cpp/tests/get_test_data.sh. Avoid changing directories if you can, just to keep this script cleaner.

Also I would consider moving get_test_data.sh to a different folder. cuGraph uses a datasets/ directory to manage these scripts, which is good. https://github.com/rapidsai/cugraph/tree/branch-25.08/datasets Otherwise you could put it in ci/.

Avoided ${GITHUB_WORKSPACE}, moved get_test_data.sh to ci/ (there doesn't seem to be an existing location that's more fitting), let get_test_data.sh get the download path from ${RAPIDS_DATASET_ROOT_DIR} if defined.

…ntize_build' into jamxia_vamana_serialize_quantize_build

jamxia155 added 9 commits April 25, 2025 10:26

Serialize Vamana index with SSD sector alignment

8c650ab

(WIP) Quantizer file parser

bad5463

Move helper function to anonymous namespace

ac4199f

Remove update to unused function

4b6a4d4

DiskANN quantization

c931f87

clang-format

765ffcd

Separate pq pivots and rotation matrix inputs

d8b1101

Align output file names with DiskANN

0bf35ee

Update usage description

d0aabc6

github-actions bot added the cpp label Apr 25, 2025

jamxia155 added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Apr 25, 2025

jamxia155 mentioned this pull request Apr 25, 2025

[WIP] Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #703

Closed

jamxia155 self-assigned this Apr 25, 2025

clang-format

6e5d908

jamxia155 marked this pull request as ready for review April 25, 2025 20:08

jamxia155 requested a review from a team as a code owner April 25, 2025 20:08

Specify condition for using quantized dataset

90eaf18

tarang-jain requested changes May 2, 2025

View reviewed changes

cjnolet added this to Vector Search, ML, & Data Mining Release Board May 5, 2025

cjnolet moved this to In Progress in Vector Search, ML, & Data Mining Release Board May 5, 2025

jamxia155 added 2 commits May 12, 2025 12:30

Address PR comments

42630fe

Merge remote-tracking branch 'upstream/branch-25.06' into jamxia_vama…

38646cf

…na_serialize_quantize_build

tarang-jain reviewed May 14, 2025

View reviewed changes

cpp/src/neighbors/detail/vamana/vamana_build.cuh Outdated Show resolved Hide resolved

tarang-jain reviewed May 14, 2025

View reviewed changes

cpp/src/neighbors/detail/vamana/vamana_build.cuh Outdated Show resolved Hide resolved

bkarsin reviewed May 16, 2025

View reviewed changes

jamxia155 added 2 commits May 19, 2025 20:46

API refactor

81cbd56

Undo unnecessary API change

62b1392

jamxia155 requested review from a team as code owners May 30, 2025 23:07

jamxia155 requested a review from jameslamb May 30, 2025 23:07

github-actions bot added ci Python labels May 30, 2025

Fetch codebooks from url

7901903

jamxia155 changed the base branch from branch-25.06 to branch-25.08 May 30, 2025 23:14

Install test files

ce7b0d6

github-actions bot removed the ci label Jun 2, 2025

jamxia155 added 5 commits June 9, 2025 08:34

Update test file paths

f4b3d9d

Merge remote-tracking branch 'upstream/branch-25.08' into jamxia_vama…

ed1e90d

…na_serialize_quantize_build

Fix path

6a36cfc

Add quotes

a716f7d

Fix if condition

8464d13

Add script for fetching baseline files from URL

868e6cb

github-actions bot added the ci label Jun 10, 2025

jamxia155 added 4 commits June 10, 2025 16:25

Download test data files in cmake

910b465

Merge remote-tracking branch 'upstream/branch-25.08' into jamxia_vama…

c8660f2

…na_serialize_quantize_build

Fix path

e6078db

Reduce verbosity

3fec92e

bdice reviewed Jun 11, 2025

View reviewed changes

jamxia155 and others added 7 commits June 11, 2025 08:59

Enable reproducing CI locally

57d8554

Merge branch 'branch-25.08' into jamxia_vamana_serialize_quantize_build

950e5ca

Retry CI

7a92b2e

Merge remote-tracking branch 'jamxia_cuvs/jamxia_vamana_serialize_qua…

26feaaf

…ntize_build' into jamxia_vamana_serialize_quantize_build

Update copyright years

d36f91c

Update copyright year

e84cbf0

Retry CI

111b7c7

Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #846

Are you sure you want to change the base?

Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #846

Uh oh!

Conversation

jamxia155 commented Apr 25, 2025

Uh oh!

copy-pr-bot bot commented Apr 25, 2025

Uh oh!

cjnolet commented Apr 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamxia155 May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cjnolet commented Jun 10, 2025

Uh oh!

jamxia155 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bdice Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jamxia155 May 20, 2025 •

edited

Loading

jamxia155 commented Jun 10, 2025 •

edited

Loading

bdice Jun 11, 2025 •

edited

Loading