-
Notifications
You must be signed in to change notification settings - Fork 107
Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.08
Are you sure you want to change the base?
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test d0aabc6 |
auto make_strided_dataset(const raft::resources& res, | ||
const SrcT& src, | ||
uint32_t required_stride, | ||
bool force_ownership = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not opposed to this flag, but what is the reasoning behind the force_ownership
flag? From my understanding, if the stride matches and the ptr is device accessible it should be okay to make it non-owning. And it is the calling process's responsibility to ensure the dataset is live until the end if the strided dataset is non-owning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you might've suspected, setting this flag is only needed in specific situations like this: during vamana::index::index()
, the function optionally computes a matrix for the quantized vectors using additional input files pointed to by index_params::codebook_prefix
. Since this matrix is created within index()
, its lifetime is limited to the index build (as opposed to the full dataset matrix which is provided to index()
as an input) so we need to force ownership of the matrix. It may be possible to improve this by somehow moving the data instead of copying but that is not the focus for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @tarang-jain here. I think there's a cleaner way to do this. Please don't treat public APIs like this as a "just need to get this done". These APIs tend to stick around for a long time and we need to make sure they are as clean and flexible as they can be. In fact, we often spend more time discussing our public APIs than we do the impl details because once these are exposed they are very hard to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data
device matrix as a member of create/destroy it during the vamana::index
constructor/destructor. Does this seem like a reasonable solution to everyone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a reasonable solution to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as discussed.
@@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index { | |||
return *dataset_; | |||
} | |||
|
|||
/** Quantized dataset [size, codes_rowlen] */ | |||
[[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>& | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_matrix_view
or host_matrix_view
as the output type, rather than a dataset type object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to model this function after vamana::index::data()
so that we can reuse the code in the serializer. Is there a reason we shouldn't return dataset
from quantized_data()
while it's ok to do so from data()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of vamana::index::data()
might have been an oversight. Here's what I think. quantized_data()
is an important public API on the index. As a new cuvs user, one should not have to familiarize themselves with obscure data types such as dataset
. My understanding is that those are for internal use-cases or to be used by other rapids repos. mdspan return types are a lot more user-friendly / readable on the other hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that it would be nice to let users avoid the dataset
type, the data()
API call by other algorithms also returns a dataset
object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did not realize that data() returns a dataset
object in CAGRA too. In that case vamana::index::data()
should be fine. But maybe the quantized data can be returned as a matrix. That is just my opinion, but @cjnolet can help design the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to const device_matrix_view
.
…na_serialize_quantize_build
@@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index { | |||
return *dataset_; | |||
} | |||
|
|||
/** Quantized dataset [size, codes_rowlen] */ | |||
[[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>& | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that it would be nice to let users avoid the dataset
type, the data()
API call by other algorithms also returns a dataset
object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.
auto make_strided_dataset(const raft::resources& res, | ||
const SrcT& src, | ||
uint32_t required_stride, | ||
bool force_ownership = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data
device matrix as a member of create/destroy it during the vamana::index
constructor/destructor. Does this seem like a reasonable solution to everyone?
@@ -408,9 +533,117 @@ index<T, IdxT> build( | |||
batched_insert_vamana<T, float, IdxT, Accessor>( | |||
res, params, dataset, vamana_graph.view(), &medoid_id, metric); | |||
|
|||
std::optional<raft::device_matrix<uint8_t, int64_t>> quantized_vectors; | |||
if (params.codebook_prefix.size()) { | |||
cuvs::neighbors::vpq_params pq_params; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for @cjnolet - is the vpq
API changing with 25.06 or in the future with the release of the new quantization API? If so, is CAGRA-Q having to change (which means this would have to change as well)?
* | ||
*/ | ||
template <typename T, typename IdxT, typename HostMatT> | ||
void serialize_sector_aligned(raft::resources const& res, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files created, their names, and how to use them with CPU DiskANN search is quite complex. Added an issue to add documentation for this as a future item: #906
Not a blocker for this PR, though.
@jamxia155 I sincerely apologize for the delayed response here. We had some critical issues for release 25.06 that took several folks from the team over a week to resolve. Good news is that the release is now complete and we're available to help with timely reviews and guidance. I notice the last several commits are related to the cmake download of the test files. These will need to work both locally for users, as well as in CI. Are they working locally for you? |
Hi @cjnolet, those changes worked locally for me but are not working in the CI yet. However, I have since received helpful guidance from @bdice on the pattern that's currently used in cuGraph and I think I have a good handle on what needs to be done now. Thanks. |
ci/test_cpp.sh
Outdated
mkdir -p "${RAPIDS_DATASET_ROOT_DIR}" | ||
export RAPIDS_DATASET_ROOT_DIR | ||
pushd "${RAPIDS_DATASET_ROOT_DIR}" | ||
${GITHUB_WORKSPACE}/cpp/tests/get_test_data.sh --NEIGHBORS_ANN_VAMANA_TEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work in GitHub Actions and will fail when reproducing CI locally. We need a solution that doesn't involve ${GITHUB_WORKSPACE}
.
I would make the get_test_data.sh
script aware of ${RAPIDS_DATASET_ROOT_DIR}
. That way it has a default location in which it downloads, and you can continue to call that script from the repository root with ./cpp/tests/get_test_data.sh
. Avoid changing directories if you can, just to keep this script cleaner.
Also I would consider moving get_test_data.sh
to a different folder. cuGraph uses a datasets/
directory to manage these scripts, which is good. https://github.com/rapidsai/cugraph/tree/branch-25.08/datasets Otherwise you could put it in ci/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoided ${GITHUB_WORKSPACE}
, moved get_test_data.sh
to ci/
(there doesn't seem to be an existing location that's more fitting), let get_test_data.sh
get the download path from ${RAPIDS_DATASET_ROOT_DIR}
if defined.
…ntize_build' into jamxia_vamana_serialize_quantize_build
(This supersedes PR!703)
Added an optional input flag to
cuvs::neighbors::vamana::serialize
to dump an input cuvs Vamana index to file with SSD sector alignment. File format follows MSFT DiskANN.Using the sector-aligned option also writes out the quantized dataset computed using user-supplied PQ codebooks file and rotation matrix file.