-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
My Goal:
Happy New Year!
I want to be able to persist CAGRA GPU indexes.
The standard Faiss workflow for this as I understand it, and please correct me if I am wrong, is:
1 GpuIndexCagra (GPU)
2. write_index (CPU Index) to disk.
3. read_index (CPU Index) from disk.
4. IndexHNSWCagra(CPU)
Problem Description:
In DGX-Spark :
When cloning a CPU IndexHNSWCagra to a GPU CAGRA index using copyFrom/index_cpu_to_gpu, the resulting GPU index appears to maintain a shallow dependency on the host memory of the source index. If the source CPU index is destroyed, subsequent searches on the GPU index fail with a CUDA illegal memory access. I also get invalid device ordinal but you can see on the output that it should be 0. I tried to follow this pattern from index_cpu_to_gpu internally : GpuCloner.cpp#L236-L244
Interesting:
- In a different H100 machine: The same script works, and I do not get the above problem. Which might hint to either a unified memory issue, or some mis-configuration?
- Also if you reduce nb, so that we're using less memory in the DGX-Spark the crash does not happen.
Setup
Tested on both machines in a container with library versions:
-- Faiss version: 1.13.0
-- cuVS version: 25.08.0
-- RAFT version: 25.08.0
-- CUDA version: 12.8.93
Minimal script for reproducibility:
#include <cuda_runtime.h>
#include <faiss/gpu/GpuIndexCagra.h>
#include <faiss/gpu/StandardGpuResources.h>
#include <faiss/IndexHNSW.h>
#include <iostream>
#include <memory>
#include <vector>
void print_available_devices() {
int deviceCount = 0;
if (cudaGetDeviceCount(&deviceCount) != cudaSuccess || deviceCount == 0) {
std::cerr << "No CUDA devices found." << std::endl;
return;
}
std::cout << "--- Available CUDA Devices ---" << std::endl;
for (int i = 0; i < deviceCount; ++i) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, i);
std::cout << "Ordinal [" << i << "]: " << prop.name << " (" << prop.totalGlobalMem / 1024 / 1024 << " MB)" << std::endl;
}
}
void run_search(faiss::gpu::GpuIndexCagra* index, int d, const std::string& label) {
std::vector<float> query(d, 0.5f);
std::vector<float> dists(1);
faiss::idx_t labels[1];
std::cout << "[ SEARCH ] " << label << "..." << std::endl;
index->search(1, query.data(), 1, dists.data(), labels);
std::cout << "[ SUCCESS ] Result ID: " << labels[0] << std::endl;
}
int main() {
print_available_devices();
const int d = 128;
const int nb = 100000;
faiss::gpu::StandardGpuResources res;
faiss::gpu::GpuIndexCagraConfig config;
config.device = 0;
// 1. Setup Data
std::vector<float> data(nb * d);
for (auto& val : data) val = (float)rand() / RAND_MAX;
// 2. Build Initial GPU Index
std::cout << "\n--- STEP 1: Building Initial Index ---" << std::endl;
auto gpu_index = std::make_unique<faiss::gpu::GpuIndexCagra>(&res, d, faiss::METRIC_L2, config);
gpu_index->train(nb, data.data());
// 3. Move it to CPU (The "Source" of the copy)
std::cout << "--- STEP 2: Moving to CPU Source ---" << std::endl;
auto cpu_source = std::make_unique<faiss::IndexHNSWCagra>(d, faiss::METRIC_L2);
gpu_index->copyTo(cpu_source.get());
// 4. Populate a NEW GPU Index from that CPU Source
std::cout << "--- STEP 3: Populating New GPU Index from CPU ---" << std::endl;
auto target_gpu = std::make_unique<faiss::gpu::GpuIndexCagra>(&res, d, faiss::METRIC_L2, config);
target_gpu->copyFrom(cpu_source.get());
// 5. Verification: Search before and after destroying the CPU source
run_search(target_gpu.get(), d, "Search (CPU Source ALIVE)");
std::cout << "--- STEP 4: Destroying CPU Source ---" << std::endl;
cpu_source.reset(); // This tests if 'target_gpu' has a shallow or deep copy
try {
run_search(target_gpu.get(), d, "Search (CPU Source DELETED)");
std::cout << "\n[ RESULT ] SURVIVED: Target GPU index is autonomous." << std::endl;
} catch (const std::exception& e) {
std::cerr << "\n[ RESULT ] CRASHED: " << e.what() << std::endl;
}
return 0;
}script output on DGX-Spark:
--- Available CUDA Devices ---
Ordinal [0]: NVIDIA GB10 (122572 MB)
--- STEP 1: Building Initial Index ---
[668652][10:11:16:199063][info ] optimizing graph
[668652][10:11:16:445967][info ] Graph optimized, creating index
--- STEP 2: Moving to CPU Source ---
--- STEP 3: Populating New GPU Index from CPU ---
[ SEARCH ] Search (CPU Source ALIVE)...
[ SUCCESS ] Result ID: 33311
--- STEP 4: Destroying CPU Source ---
[ SEARCH ] Search (CPU Source DELETED)...
[ RESULT ] CRASHED: transform: failed inside CUB: cudaErrorInvalidDevice: invalid device ordinal
CUDA call='cudaFreeAsync(ptr, stream)' at file=/tmp/conda-bld-output/bld/rattler-build_libcuvs/work/cpp/src/neighbors/detail/cagra/compute_distance.hpp line=259 failed with an illegal memory access was encountered
Faiss assertion 'err__ == cudaSuccess' failed in virtual faiss::gpu::StandardGpuResourcesImpl::~StandardGpuResourcesImpl() at /home/faiss/faiss/gpu/StandardGpuResources.cpp:141; details: CUDA error 700 an illegal memory access was encountered
Aborted (core dumped)
Output on h100:
--- Available CUDA Devices ---
Ordinal [0]: NVIDIA H100 NVL (95319 MB)
--- STEP 1: Building Initial Index ---
[ 11236][10:25:38:098496][info ] optimizing graph
[ 11236][10:25:38:310985][info ] Graph optimized, creating index
--- STEP 2: Moving to CPU Source ---
--- STEP 3: Populating New GPU Index from CPU ---
[ SEARCH ] Search (CPU Source ALIVE)...
[ SUCCESS ] Result ID: 57286
--- STEP 4: Destroying CPU Source ---
[ SEARCH ] Search (CPU Source DELETED)...
[ SUCCESS ] Result ID: 57286
[ RESULT ] SURVIVED: Target GPU index is autonomous.