Skip to content

Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: branch-25.08
Choose a base branch
from

Conversation

jamxia155
Copy link

(This supersedes PR!703)

Added an optional input flag to cuvs::neighbors::vamana::serialize to dump an input cuvs Vamana index to file with SSD sector alignment. File format follows MSFT DiskANN.

Using the sector-aligned option also writes out the quantized dataset computed using user-supplied PQ codebooks file and rotation matrix file.

Copy link

copy-pr-bot bot commented Apr 25, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cjnolet
Copy link
Member

cjnolet commented Apr 25, 2025

/ok to test d0aabc6

@jamxia155 jamxia155 marked this pull request as ready for review April 25, 2025 20:08
@jamxia155 jamxia155 requested a review from a team as a code owner April 25, 2025 20:08
auto make_strided_dataset(const raft::resources& res,
const SrcT& src,
uint32_t required_stride,
bool force_ownership = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this flag, but what is the reasoning behind the force_ownership flag? From my understanding, if the stride matches and the ptr is device accessible it should be okay to make it non-owning. And it is the calling process's responsibility to ensure the dataset is live until the end if the strided dataset is non-owning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you might've suspected, setting this flag is only needed in specific situations like this: during vamana::index::index(), the function optionally computes a matrix for the quantized vectors using additional input files pointed to by index_params::codebook_prefix. Since this matrix is created within index(), its lifetime is limited to the index build (as opposed to the full dataset matrix which is provided to index() as an input) so we need to force ownership of the matrix. It may be possible to improve this by somehow moving the data instead of copying but that is not the focus for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tarang-jain here. I think there's a cleaner way to do this. Please don't treat public APIs like this as a "just need to get this done". These APIs tend to stick around for a long time and we need to make sure they are as clean and flexible as they can be. In fact, we often spend more time discussing our public APIs than we do the impl details because once these are exposed they are very hard to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data device matrix as a member of create/destroy it during the vamana::index constructor/destructor. Does this seem like a reasonable solution to everyone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a reasonable solution to me.

Copy link
Author

@jamxia155 jamxia155 May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as discussed.

@@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index {
return *dataset_;
}

/** Quantized dataset [size, codes_rowlen] */
[[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>&
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_matrix_view or host_matrix_view as the output type, rather than a dataset type object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to model this function after vamana::index::data() so that we can reuse the code in the serializer. Is there a reason we shouldn't return dataset from quantized_data() while it's ok to do so from data()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of vamana::index::data() might have been an oversight. Here's what I think. quantized_data() is an important public API on the index. As a new cuvs user, one should not have to familiarize themselves with obscure data types such as dataset. My understanding is that those are for internal use-cases or to be used by other rapids repos. mdspan return types are a lot more user-friendly / readable on the other hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that it would be nice to let users avoid the dataset type, the data() API call by other algorithms also returns a dataset object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did not realize that data() returns a dataset object in CAGRA too. In that case vamana::index::data() should be fine. But maybe the quantized data can be returned as a matrix. That is just my opinion, but @cjnolet can help design the API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to const device_matrix_view.

@@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index {
return *dataset_;
}

/** Quantized dataset [size, codes_rowlen] */
[[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>&
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that it would be nice to let users avoid the dataset type, the data() API call by other algorithms also returns a dataset object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.

auto make_strided_dataset(const raft::resources& res,
const SrcT& src,
uint32_t required_stride,
bool force_ownership = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data device matrix as a member of create/destroy it during the vamana::index constructor/destructor. Does this seem like a reasonable solution to everyone?

@@ -408,9 +533,117 @@ index<T, IdxT> build(
batched_insert_vamana<T, float, IdxT, Accessor>(
res, params, dataset, vamana_graph.view(), &medoid_id, metric);

std::optional<raft::device_matrix<uint8_t, int64_t>> quantized_vectors;
if (params.codebook_prefix.size()) {
cuvs::neighbors::vpq_params pq_params;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for @cjnolet - is the vpq API changing with 25.06 or in the future with the release of the new quantization API? If so, is CAGRA-Q having to change (which means this would have to change as well)?

*
*/
template <typename T, typename IdxT, typename HostMatT>
void serialize_sector_aligned(raft::resources const& res,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The files created, their names, and how to use them with CPU DiskANN search is quite complex. Added an issue to add documentation for this as a future item: #906
Not a blocker for this PR, though.

@jamxia155 jamxia155 requested review from a team as code owners May 30, 2025 23:07
@jamxia155 jamxia155 requested a review from jameslamb May 30, 2025 23:07
@jamxia155 jamxia155 changed the base branch from branch-25.06 to branch-25.08 May 30, 2025 23:14
@github-actions github-actions bot removed the ci label Jun 2, 2025
@cjnolet
Copy link
Member

cjnolet commented Jun 10, 2025

@jamxia155 I sincerely apologize for the delayed response here. We had some critical issues for release 25.06 that took several folks from the team over a week to resolve. Good news is that the release is now complete and we're available to help with timely reviews and guidance.

I notice the last several commits are related to the cmake download of the test files. These will need to work both locally for users, as well as in CI. Are they working locally for you?

@jamxia155
Copy link
Author

jamxia155 commented Jun 10, 2025

I notice the last several commits are related to the cmake download of the test files. These will need to work both locally for users, as well as in CI. Are they working locally for you?

Hi @cjnolet, those changes worked locally for me but are not working in the CI yet. However, I have since received helpful guidance from @bdice on the pattern that's currently used in cuGraph and I think I have a good handle on what needs to be done now. Thanks.

@github-actions github-actions bot added the ci label Jun 10, 2025
ci/test_cpp.sh Outdated
mkdir -p "${RAPIDS_DATASET_ROOT_DIR}"
export RAPIDS_DATASET_ROOT_DIR
pushd "${RAPIDS_DATASET_ROOT_DIR}"
${GITHUB_WORKSPACE}/cpp/tests/get_test_data.sh --NEIGHBORS_ANN_VAMANA_TEST
Copy link
Contributor

@bdice bdice Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work in GitHub Actions and will fail when reproducing CI locally. We need a solution that doesn't involve ${GITHUB_WORKSPACE}.

I would make the get_test_data.sh script aware of ${RAPIDS_DATASET_ROOT_DIR}. That way it has a default location in which it downloads, and you can continue to call that script from the repository root with ./cpp/tests/get_test_data.sh. Avoid changing directories if you can, just to keep this script cleaner.

Also I would consider moving get_test_data.sh to a different folder. cuGraph uses a datasets/ directory to manage these scripts, which is good. https://github.com/rapidsai/cugraph/tree/branch-25.08/datasets Otherwise you could put it in ci/.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoided ${GITHUB_WORKSPACE}, moved get_test_data.sh to ci/ (there doesn't seem to be an existing location that's more fitting), let get_test_data.sh get the download path from ${RAPIDS_DATASET_ROOT_DIR} if defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci CMake cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change Python
Projects
Development

Successfully merging this pull request may close these issues.

6 participants