-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Hi,
I am noticing that during conversion of a GPU index (built using CAGRA) to CPU index, the CPU memory spikes for datasets that have a high number of vectors but low number of dimensions, compared with datasets that have a lower number of vectors but higher number of dimensions. For example, for a 40 million x 128 fp32 vector dataset, the CPU memory spikes to almost ~60 GB. However, for a 4 million x 1536 fp32 vector dataset, the CPU memory stays under ~50 GB. The 40 million x 128 dataset takes up 19531.25 MiB of CPU memory, but the 4 million x 1536 dataset takes up 23438 MiB of CPU memory. I am curious if there's a reason why the CPU memory taken up during the GPU to CPU conversion for the 40 million x 128 dataset is higher, even though the size of the vector dataset is smaller. Is there a bug in the faiss code that is causing this spike, or this expected?
My Setup
I used an EC2 g6.12xlarge machine with the Deep Learning Base OSS Nvidia Driver GPU AMI (Amazon Linux 2023) 20250912 AMI: https://docs.aws.amazon.com/dlami/latest/devguide/aws-deep-learning-base-gpu-ami-amazon-linux-2023.html
output of nvidia-smi:
Mon Oct 13 03:30:57 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:38:00.0 Off | 0 |
| N/A 36C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:3A:00.0 Off | 0 |
| N/A 31C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:3C:00.0 Off | 0 |
| N/A 31C P8 16W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:3E:00.0 Off | 0 |
| N/A 27C P8 11W / 72W | 0MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Reproduction instructions
- On a server with GPUs and
condainstalled, setup the followingcondaenvironment:
conda create -n faiss_test_new -c conda-forge -c pytorch -c nvidia -c rapidsai python=3.12 faiss-gpu-cuvs=1.12.0 py3nvml pandas matplotlib psutil
- Activate conda env:
conda activate faiss_test_new
-
Download the test script: https://gist.github.com/rchitale7/1e2a9c417139bfeeb902ba44f3e21a76
-
Run the test script with vector document and dimension counts. For example, for 40 mil x 128 dataset you can do:
python faiss_test.py -docs 40000000 --dims 128
- This will generate a csv file with the CPU memory timestamps, called
cpu_metrics_cagra_docs_40000000_dims_128.csv. This will also generate a graph calledmemory_correlation_docs_40000000_dims_128.pngthat correlates the events during the index build process with the CPU memory at the relevant timestamps. In the script, I used thecpu_used_process_memorycolumn in the csv file for the graph. The process CPU memory is measured using thepsutillibrary.
I've attached to the graphs I generated for the 40 mil x 128 and 4 mil x 1536 datasets to this issue, when i ran the script on an EC2 g6.12xlarge machine.
