Skip to content

Serialize Vamana index with SSD sector alignment per MSFT DiskANN format, generate quantized dataset for integration with DiskANN #846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: branch-25.08
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
8c650ab
Serialize Vamana index with SSD sector alignment
jamxia155 Feb 14, 2025
bad5463
(WIP) Quantizer file parser
jamxia155 Feb 27, 2025
ac4199f
Move helper function to anonymous namespace
jamxia155 Feb 28, 2025
4b6a4d4
Remove update to unused function
jamxia155 Feb 28, 2025
c931f87
DiskANN quantization
jamxia155 Mar 5, 2025
765ffcd
clang-format
jamxia155 Apr 2, 2025
d8b1101
Separate pq pivots and rotation matrix inputs
jamxia155 Apr 5, 2025
0bf35ee
Align output file names with DiskANN
jamxia155 Apr 5, 2025
d0aabc6
Update usage description
jamxia155 Apr 9, 2025
6e5d908
clang-format
jamxia155 Apr 25, 2025
90eaf18
Specify condition for using quantized dataset
jamxia155 Apr 26, 2025
42630fe
Address PR comments
jamxia155 May 12, 2025
38646cf
Merge remote-tracking branch 'upstream/branch-25.06' into jamxia_vama…
jamxia155 May 12, 2025
81cbd56
API refactor
jamxia155 May 20, 2025
62b1392
Undo unnecessary API change
jamxia155 May 20, 2025
44f479b
Address minor review comments
jamxia155 May 20, 2025
a40237e
Avoid data conversion conditionally
jamxia155 May 20, 2025
673c087
Merge remote-tracking branch 'upstream/branch-25.06' into jamxia_vama…
jamxia155 May 20, 2025
f78ebcb
clang-format
jamxia155 May 20, 2025
31a1ddb
Minor formatting change
jamxia155 May 20, 2025
b0feec1
Minor rename
jamxia155 May 20, 2025
912a651
Retry CI
jamxia155 May 20, 2025
40beb58
Remove unused header
jamxia155 May 22, 2025
f38fd45
Address review comments
jamxia155 May 24, 2025
9118ad2
Merge remote-tracking branch 'upstream/branch-25.06' into jamxia_vama…
jamxia155 May 24, 2025
957fee1
Check file creation
jamxia155 May 24, 2025
5b1bc2e
Check that output files exist and are non-empty
jamxia155 May 24, 2025
bc1c546
Switch to randomized codebooks (WIP)
jamxia155 May 29, 2025
7ba37ab
Merge branch 'rapidsai:branch-25.08' into jamxia_vamana_serialize_qua…
jamxia155 May 30, 2025
7901903
Fetch codebooks from url
jamxia155 May 30, 2025
ce7b0d6
Install test files
jamxia155 Jun 2, 2025
f4b3d9d
Update test file paths
jamxia155 Jun 9, 2025
ed1e90d
Merge remote-tracking branch 'upstream/branch-25.08' into jamxia_vama…
jamxia155 Jun 9, 2025
6a36cfc
Fix path
jamxia155 Jun 9, 2025
a716f7d
Add quotes
jamxia155 Jun 9, 2025
8464d13
Fix if condition
jamxia155 Jun 9, 2025
868e6cb
Add script for fetching baseline files from URL
jamxia155 Jun 10, 2025
910b465
Download test data files in cmake
jamxia155 Jun 10, 2025
c8660f2
Merge remote-tracking branch 'upstream/branch-25.08' into jamxia_vama…
jamxia155 Jun 10, 2025
e6078db
Fix path
jamxia155 Jun 11, 2025
3fec92e
Reduce verbosity
jamxia155 Jun 11, 2025
57d8554
Enable reproducing CI locally
jamxia155 Jun 11, 2025
950e5ca
Merge branch 'branch-25.08' into jamxia_vamana_serialize_quantize_build
cjnolet Jun 12, 2025
7a92b2e
Retry CI
jamxia155 Jun 12, 2025
26feaaf
Merge remote-tracking branch 'jamxia_cuvs/jamxia_vamana_serialize_qua…
jamxia155 Jun 12, 2025
d36f91c
Update copyright years
jamxia155 Jun 13, 2025
e84cbf0
Update copyright year
jamxia155 Jun 13, 2025
111b7c7
Retry CI
jamxia155 Jun 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions cpp/include/cuvs/neighbors/common.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -206,10 +206,14 @@ inline constexpr bool is_strided_dataset_v = is_strided_dataset<DatasetT>::value
* @param[in] res raft resources handle
* @param[in] src the source mdarray or mdspan
* @param[in] required_stride the leading dimension (in elements)
* @param[in] force_ownership force an owning_dataset to be returned (default: false)
* @return maybe owning current-device-accessible strided matrix
*/
template <typename SrcT>
auto make_strided_dataset(const raft::resources& res, const SrcT& src, uint32_t required_stride)
auto make_strided_dataset(const raft::resources& res,
const SrcT& src,
uint32_t required_stride,
bool force_ownership = false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to this flag, but what is the reasoning behind the force_ownership flag? From my understanding, if the stride matches and the ptr is device accessible it should be okay to make it non-owning. And it is the calling process's responsibility to ensure the dataset is live until the end if the strided dataset is non-owning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you might've suspected, setting this flag is only needed in specific situations like this: during vamana::index::index(), the function optionally computes a matrix for the quantized vectors using additional input files pointed to by index_params::codebook_prefix. Since this matrix is created within index(), its lifetime is limited to the index build (as opposed to the full dataset matrix which is provided to index() as an input) so we need to force ownership of the matrix. It may be possible to improve this by somehow moving the data instead of copying but that is not the focus for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tarang-jain here. I think there's a cleaner way to do this. Please don't treat public APIs like this as a "just need to get this done". These APIs tend to stick around for a long time and we need to make sure they are as clean and flexible as they can be. In fact, we often spend more time discussing our public APIs than we do the impl details because once these are exposed they are very hard to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have talked a bit about how to handle this, and without the ability to force ownership here, we would need to have the quantized_data device matrix as a member of create/destroy it during the vamana::index constructor/destructor. Does this seem like a reasonable solution to everyone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a reasonable solution to me.

Copy link
Author

@jamxia155 jamxia155 May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as discussed.

-> std::unique_ptr<strided_dataset<typename SrcT::value_type, typename SrcT::index_type>>
{
using extents_type = typename SrcT::extents_type;
Expand All @@ -231,13 +235,14 @@ auto make_strided_dataset(const raft::resources& res, const SrcT& src, uint32_t
const bool row_major = src.stride(1) <= 1;
const bool stride_matches = required_stride == src_stride;

if (device_accessible && row_major && stride_matches) {
if (device_accessible && row_major && stride_matches && !force_ownership) {
// Everything matches: make a non-owning dataset
return std::make_unique<non_owning_dataset<value_type, index_type>>(
raft::make_device_strided_matrix_view<const value_type, index_type>(
device_ptr, src.extent(0), src.extent(1), required_stride));
}
// Something is wrong: have to make a copy and produce an owning dataset
// Something is wrong (or force_ownership = true): have to make a copy and produce an owning
// dataset
auto out_layout =
raft::make_strided_layout(src.extents(), std::array<index_type, 2>{required_stride, 1});
auto out_array =
Expand Down Expand Up @@ -347,18 +352,22 @@ auto make_strided_dataset(
* @param[in] res raft resources handle
* @param[in] src the source mdarray or mdspan
* @param[in] align_bytes the required byte alignment for the dataset rows.
* @param[in] force_ownership force an owning_dataset to be returned (default: false)
* @return maybe owning current-device-accessible strided matrix
*/
template <typename SrcT>
auto make_aligned_dataset(const raft::resources& res, SrcT src, uint32_t align_bytes = 16)
auto make_aligned_dataset(const raft::resources& res,
SrcT src,
uint32_t align_bytes = 16,
bool force_ownership = false)
-> std::unique_ptr<strided_dataset<typename SrcT::value_type, typename SrcT::index_type>>
{
using source_type = std::remove_cv_t<std::remove_reference_t<SrcT>>;
using value_type = typename source_type::value_type;
constexpr size_t kSize = sizeof(value_type);
uint32_t required_stride =
raft::round_up_safe<size_t>(src.extent(1) * kSize, std::lcm(align_bytes, kSize)) / kSize;
return make_strided_dataset(res, std::forward<SrcT>(src), required_stride);
return make_strided_dataset(res, std::forward<SrcT>(src), required_stride, force_ownership);
}
/**
* @brief VPQ compressed dataset.
Expand Down
40 changes: 37 additions & 3 deletions cpp/include/cuvs/neighbors/vamana.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,10 @@ struct index_params : cuvs::neighbors::index_params {
uint32_t queue_size = 127;
/** Max batchsize of reverse edge processing (reduces memory footprint) */
uint32_t reverse_batchsize = 1000000;
/** Path prefix to pq pivots and rotation matrix files. Expects pq pivots file at
* "${codebook_prefix}_pq_pivots.bin" and rotation matrix file at
* "${codebook_prefix}_pq_pivots.bin_rotation_matrix.bin". */
std::string codebook_prefix = "";
};

/**
Expand Down Expand Up @@ -127,6 +131,13 @@ struct index : cuvs::neighbors::index {
return *dataset_;
}

/** Quantized dataset [size, codes_rowlen] */
[[nodiscard]] inline auto quantized_data() const -> const cuvs::neighbors::dataset<int64_t>&
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_matrix_view or host_matrix_view as the output type, rather than a dataset type object.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to model this function after vamana::index::data() so that we can reuse the code in the serializer. Is there a reason we shouldn't return dataset from quantized_data() while it's ok to do so from data()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of vamana::index::data() might have been an oversight. Here's what I think. quantized_data() is an important public API on the index. As a new cuvs user, one should not have to familiarize themselves with obscure data types such as dataset. My understanding is that those are for internal use-cases or to be used by other rapids repos. mdspan return types are a lot more user-friendly / readable on the other hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that it would be nice to let users avoid the dataset type, the data() API call by other algorithms also returns a dataset object, not a matrix. So, if we want to change how these API calls work, we would need to be consistent across CAGRA and other algorithms as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I did not realize that data() returns a dataset object in CAGRA too. In that case vamana::index::data() should be fine. But maybe the quantized data can be returned as a matrix. That is just my opinion, but @cjnolet can help design the API.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to const device_matrix_view.

RAFT_EXPECTS(quantized_dataset_, "Invalid quantized dataset");
return *quantized_dataset_;
}

/** vamana graph [size, graph-degree] */
[[nodiscard]] inline auto graph() const noexcept
-> raft::device_matrix_view<const IdxT, int64_t, raft::row_major>
Expand Down Expand Up @@ -212,11 +223,28 @@ struct index : cuvs::neighbors::index {
graph_view_ = graph_.view();
}

/**
* Replace the quantized dataset with a new dataset.
*
* If `force_ownership` is set, we create a copy of the quantized dataset on the device,
* and the index manages the lifetime of this copy.
* Otherwise, we store a reference to the device data in `new_quantized_dataset`, and it
* is the caller's responsibility to ensure that the data stays alive as long as the index.
*/
void update_quantized_dataset(
raft::resources const& res,
raft::device_matrix_view<uint8_t, int64_t, raft::row_major> new_quantized_dataset,
bool force_ownership)
{
quantized_dataset_ = make_aligned_dataset(res, new_quantized_dataset, 16, force_ownership);
}

private:
cuvs::distance::DistanceType metric_;
raft::device_matrix<IdxT, int64_t, raft::row_major> graph_;
raft::device_matrix_view<const IdxT, int64_t, raft::row_major> graph_view_;
std::unique_ptr<neighbors::dataset<int64_t>> dataset_;
std::unique_ptr<neighbors::dataset<int64_t>> quantized_dataset_;
IdxT medoid_id_;
};
/**
Expand Down Expand Up @@ -457,13 +485,15 @@ auto build(raft::resources const& res,
* @param[in] file_prefix prefix of path and name of index files
* @param[in] index Vamana index
* @param[in] include_dataset whether or not to serialize the dataset
* @param[in] sector_aligned whether output file should be aligned to disk sectors of 4096 bytes
*
*/

void serialize(raft::resources const& handle,
const std::string& file_prefix,
const cuvs::neighbors::vamana::index<float, uint32_t>& index,
bool include_dataset = true);
bool include_dataset = true,
bool sector_aligned = false);

/**
* Save the index to file.
Expand All @@ -486,12 +516,14 @@ void serialize(raft::resources const& handle,
* @param[in] file_prefix prefix of path and name of index files
* @param[in] index Vamana index
* @param[in] include_dataset whether or not to serialize the dataset
* @param[in] sector_aligned whether output file should be aligned to disk sectors of 4096 bytes
*
*/
void serialize(raft::resources const& handle,
const std::string& file_prefix,
const cuvs::neighbors::vamana::index<int8_t, uint32_t>& index,
bool include_dataset = true);
bool include_dataset = true,
bool sector_aligned = false);

/**
* Save the index to file.
Expand All @@ -514,12 +546,14 @@ void serialize(raft::resources const& handle,
* @param[in] file_prefix prefix of path and name of index files
* @param[in] index Vamana index
* @param[in] include_dataset whether or not to serialize the dataset
* @param[in] sector_aligned whether output file should be aligned to disk sectors of 4096 bytes
*
*/
void serialize(raft::resources const& handle,
const std::string& file_prefix,
const cuvs::neighbors::vamana::index<uint8_t, uint32_t>& index,
bool include_dataset = true);
bool include_dataset = true,
bool sector_aligned = false);

/**
* @}
Expand Down
Loading