RFC: top-k search with large k #4575

mdouze · 2025-09-05T15:13:46Z

mdouze
Sep 5, 2025
Collaborator

When searching with a large value of k (1000+), collecting the top-k results becomes dominant in the search time compared to exploring the index and computing distances.

Symptoms

For example, for an IVF200,Flat in 64D with 100k vectors and nprobe=10 we get something like (1 thread search):

k=1 time=0.580 s, -1s: 0.00 % ivf: 79842/51262281 
k=2 time=0.516 s, -1s: 0.00 % ivf: 149446/51262281 
k=5 time=0.498 s, -1s: 0.00 % ivf: 336793/51262281 
k=10 time=0.509 s, -1s: 0.00 % ivf: 616375/51262281 
k=20 time=0.547 s, -1s: 0.00 % ivf: 1111214/51262281 
k=50 time=0.688 s, -1s: 0.00 % ivf: 2373636/51262281 
k=100 time=0.859 s, -1s: 0.00 % ivf: 4137516/51262281 
k=200 time=1.185 s, -1s: 0.00 % ivf: 7095604/51262281 
k=500 time=2.019 s, -1s: 0.00 % ivf: 14069370/51262281 
k=1000 time=3.214 s, -1s: 0.00 % ivf: 23100713/51262281 
k=2000 time=5.239 s, -1s: 0.00 % ivf: 35884421/51262281 
k=5000 time=9.221 s, -1s: 2.87 % ivf: 50981769/51262281

The ivf: n1/n2 is n1=number of times the top-k structure was updated (cost log(k)), n2 = number of distances computed.

ie. returning 5k results is 20x slower than returning 1 result (Admittedly, this is an extreme case since we basically ask to return 5% of the dataset). The -1s correspond to empty results (that were not filled in during search).

heap and reservoir

The approach used in Faiss is to collect the results in a reservoir rather than a heap.
The reservoir is a table of size 2k (or some other factor) that is filled unconditionally with results. When it is full, we compute the median distance (complexity O(k)) and drop all results above the median.
This ought to be cheaper than the log(k) update cost of the heap.
When the scanning is finished, the top-k results out of the 2k are returned.

Note that both the heap and the reservoir, when they contain more than k results, are "guarded" by a threshold that rejects any distance worse than the current k-th result.

Current reservoir implementation

In fact, the reservoir is implemented only for the "flat" datasets, as ReservoirTopN.
It is enabled when k is larger than distance_compute_min_k_reservoir (a global variable set to 100 by default).

When searching on a flat index with the same config we get:
distance_compute_min_k_reservoir=100

k=1 time=2.046 s, -1s: 0.00 % RH: 0/0
k=2 time=2.387 s, -1s: 0.00 % RH: 511859/2228800000
k=5 time=2.496 s, -1s: 0.00 % RH: 1201961/2228800000
k=10 time=2.520 s, -1s: 0.00 % RH: 2273720/2228800000
k=20 time=2.495 s, -1s: 0.00 % RH: 4250091/2228800000
k=50 time=2.628 s, -1s: 0.00 % RH: 9607928/2228800000
k=100 time=4.497 s, -1s: 0.00 % RH: 12081284/1000000000
k=200 time=5.026 s, -1s: 0.00 % RH: 21400069/1000000000
k=500 time=6.438 s, -1s: 0.00 % RH: 46403235/1000000000
k=1000 time=8.792 s, -1s: 0.00 % RH: 81340627/1000000000
k=2000 time=12.708 s, -1s: 0.00 % RH: 140709665/1000000000
k=5000 time=23.014 s, -1s: 0.00 % RH: 279781445/1000000000

distance_compute_min_k_reservoir=10000

k=1 time=1.956 s, RH: 0/0
k=2 time=3.733 s, RH: 511859/2228800000
k=5 time=3.767 s, RH: 1201961/2228800000
k=10 time=3.764 s, RH: 2273720/2228800000
k=20 time=3.872 s, RH: 4250091/2228800000
k=50 time=3.934 s, RH: 9607928/2228800000
k=100 time=4.134 s, RH: 17666922/2228800000
k=200 time=4.689 s, RH: 32217226/2228800000
k=500 time=6.587 s, RH: 70361612/2228800000
k=1000 time=9.448 s, RH: 125103374/2228800000
k=2000 time=14.988 s,RH: 219148941/2228800000
k=5000 time=30.687 s, RH: 445288993/2228800000

RH: indicates the number of reservoir updates / total number of calls to the update function

Observations:

the break-down point between the two in speed is more around k=500 than 100. Above that the reservoir is indeed faster
the gap between k=1 and k=5000 is much higher for the Flat index than the IVFFlat index (20 s instead of 9s). This can be explained by looking at the number of top-k structure updates: for the IVF it is 50M, for Flat it is 280M. This is because (1) IVF scans a much smaller fraction of the dataset so statistically fewer updates and (2) it processes inverted lists from near to far, so the best vectors are processed first, then those that are less likely to be in the top-k.

What can be improved?

We could implement the reservoir for IVF. From the results above it would be around 25% faster.
This requires some re-engineering. In particular, the InvertedListScanner would take a result collector instead (or as an alternative) of the current labels and distances.

A broader approach would be to consider that this is similar to filtered search: the top-k is a "filter" over all the computed distances.
Therefore, at search time for IVF (or graph indexes) we would collect all results. In the end, we'd run a parition sort to keep only the results. This is an extreme case of reservoir. It is not clear without implementing how much faster this would be compared to the reservoir approach....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: top-k search with large k #4575

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC: top-k search with large k #4575

Uh oh!

mdouze Sep 5, 2025 Collaborator

Symptoms

heap and reservoir

Current reservoir implementation

What can be improved?

Replies: 0 comments

mdouze
Sep 5, 2025
Collaborator