You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+95-5Lines changed: 95 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,14 @@
1
-
# Binary Vector Search Examples for USearch
1
+
# Optimizing Binary Vector Search with USearch
2
2
3
-
This repository contains examples for constructing binary vector-search indicies for WikiPedia embeddings available on the HuggingFace portal:
3
+
The ultimate compressed vector representation is a vector of individual bits, as opposed to more common 32-bit `f32` floats and 8-bit `i8` integers.
4
+
That representation is natively supported by [USearch](https://github.com/unum-cloud/usearch), but given the tiny size of the vectors, more optimizations can be explored to scale to larger datasets.
5
+
Luckily, modern embedding models are often trained in Quantization-aware manner, and precomputed WikiPedia embeddings are available on the HuggingFace portal:
Native optimizations are implemented [NumBa](https://numba.pydata.org/) and [Cppyy](https://cppyy.readthedocs.io/) JIT compiler for Python and C++, and are packed into [UV](https://docs.astral.sh/uv/)-compatible scripts.
11
+
8
12
## Running Examples
9
13
10
14
To view the results, check out the [`bench.ipynb`](bench.ipynb).
@@ -25,11 +29,11 @@ In both cases, the embeddings have 1024 dimensions, each represented with a sing
25
29
## Optimizations
26
30
27
31
Knowing the length of embeddings is very handy for optimizations.
28
-
If the embeddings are only 1024 bits long, we only need 2 ZMMregisters to store the entire vector.
29
-
We don't need any `for`-loops, then entire operation can be unrolled and inlined.
32
+
If the embeddings are only 1024 bits long, we only need 2 `ZMM` CPU registers on x86 machines to store the entire vector.
33
+
We don't need any `for`-loops, the entire operation can be unrolled and inlined.
As shown in [less_slow.cpp](https://github.com/ashvardanian/less_slow.cpp), decomposing `for`-loops (which are equivalent to `if`-statements and jumps) into unrolled kernels is a universally great idea.
51
+
But the problem with the kernel above is that `_mm512_popcnt_epi64` is an expensive instruction, and most Intel CPUs can only execute it on a single CPU port.
52
+
There are several ways to implement population counts on SIMD-capable x86 CPUs.
53
+
Focusing on AVX-512, we can either use the `VPOPCNTQ` instruction, or the `VPSHUFB` instruction to shuffle bits and then use `VPSADBW` to sum them up:
- On Ice Lake: 3 cycles latency and executes only on port 5.
64
+
- On Zen4: 3 cycles and executes on both port 0 and 1.
65
+
66
+
So when implementing the Jaccard distance, the most important kernel for binary similarity indices using the `VPOPCNTQ`, we will have 4 such instruction invocations that will all stall on the same port number 5:
A naive Jaccard distance implementation in C++20 using STL's [`std::popcount`](https://en.cppreference.com/w/cpp/numeric/popcount) would look like this:
102
+
103
+
```cpp
104
+
#include <bit>
105
+
106
+
float jaccard_distance_192d(std::uint64_t const* a, std::uint64_t const* b) {
That rule is an example of Carry-Save Adders (CSAs) and can be used for longer sequences as well.
126
+
So $N$ "Odd-Major" population counts can cover the logic needed for the $2^N-1$ original counts:
127
+
128
+
- 7x `std::popcount` can be reduced to 3x.
129
+
- 15x `std::popcount` can be reduced to 4x.
130
+
- 31x `std::popcount` can be reduced to 5x.
131
+
132
+
When dealing with 1024-dimensional bit-vectors on 64-bit machines, we can view them as 16x 64-bit words, leveraging the "Odd-Majority" trick to fold first 15x `std::popcount` into 4x, and then having 1x more call for the tail, thus shrinking from 16x to 5x calls, at the cost of several `XOR` and `AND` operations.
133
+
134
+
### Testing and Profiling Kernels
135
+
46
136
To run the kernel benchmarks, use the following command:
0 commit comments