Skip to content

Commit e01f753

Browse files
committed
Docs: Odd-Majority & Harley-Seal
ashvardanian/SimSIMD#138
1 parent 8eb4d36 commit e01f753

File tree

2 files changed

+107
-5
lines changed

2 files changed

+107
-5
lines changed

.vscode/settings.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
{
2+
"cSpell.words": [
3+
"binarize",
4+
"cppyy",
5+
"Jaccard",
6+
"mixedbread",
7+
"packbits",
8+
"popcount",
9+
"tanimoto",
10+
"usearch"
11+
]
12+
}

README.md

Lines changed: 95 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
1-
# Binary Vector Search Examples for USearch
1+
# Optimizing Binary Vector Search with USearch
22

3-
This repository contains examples for constructing binary vector-search indicies for WikiPedia embeddings available on the HuggingFace portal:
3+
The ultimate compressed vector representation is a vector of individual bits, as opposed to more common 32-bit `f32` floats and 8-bit `i8` integers.
4+
That representation is natively supported by [USearch](https://github.com/unum-cloud/usearch), but given the tiny size of the vectors, more optimizations can be explored to scale to larger datasets.
5+
Luckily, modern embedding models are often trained in Quantization-aware manner, and precomputed WikiPedia embeddings are available on the HuggingFace portal:
46

57
- [Co:here](https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3)
68
- [MixedBread.ai](https://huggingface.co/datasets/mixedbread-ai/wikipedia-embed-en-2023-11)
79

10+
Native optimizations are implemented [NumBa](https://numba.pydata.org/) and [Cppyy](https://cppyy.readthedocs.io/) JIT compiler for Python and C++, and are packed into [UV](https://docs.astral.sh/uv/)-compatible scripts.
11+
812
## Running Examples
913

1014
To view the results, check out the [`bench.ipynb`](bench.ipynb).
@@ -25,11 +29,11 @@ In both cases, the embeddings have 1024 dimensions, each represented with a sing
2529
## Optimizations
2630

2731
Knowing the length of embeddings is very handy for optimizations.
28-
If the embeddings are only 1024 bits long, we only need 2 ZMM registers to store the entire vector.
29-
We don't need any `for`-loops, then entire operation can be unrolled and inlined.
32+
If the embeddings are only 1024 bits long, we only need 2 `ZMM` CPU registers on x86 machines to store the entire vector.
33+
We don't need any `for`-loops, the entire operation can be unrolled and inlined.
3034

3135
```c
32-
inline uint64_t hamming_distance(uint8_t const* first_vector, uint8_t const* second_vector) {
36+
uint64_t hamming_distance_1024d(uint8_t const* first_vector, uint8_t const* second_vector) {
3337
__m512i const first_start = _mm512_loadu_si512((__m512i const*)(first_vector));
3438
__m512i const first_end = _mm512_loadu_si512((__m512i const*)(first_vector + 64));
3539
__m512i const second_start = _mm512_loadu_si512((__m512i const*)(second_vector));
@@ -43,6 +47,92 @@ inline uint64_t hamming_distance(uint8_t const* first_vector, uint8_t const* sec
4347
}
4448
```
4549
50+
As shown in [less_slow.cpp](https://github.com/ashvardanian/less_slow.cpp), decomposing `for`-loops (which are equivalent to `if`-statements and jumps) into unrolled kernels is a universally great idea.
51+
But the problem with the kernel above is that `_mm512_popcnt_epi64` is an expensive instruction, and most Intel CPUs can only execute it on a single CPU port.
52+
There are several ways to implement population counts on SIMD-capable x86 CPUs.
53+
Focusing on AVX-512, we can either use the `VPOPCNTQ` instruction, or the `VPSHUFB` instruction to shuffle bits and then use `VPSADBW` to sum them up:
54+
55+
- [VPOPCNTQ (ZMM, ZMM)](https://uops.info/html-instr/VPOPCNTQ_ZMM_ZMM.html):
56+
- On Ice Lake: 3 cycles latency and executes only on port 5.
57+
- On Zen4: 2 cycles and executes on both port 0 and 1.
58+
- [VPSHUFB (ZMM, ZMM, ZMM)](https://uops.info/html-instr/VPSHUFB_ZMM_ZMM_ZMM.html):
59+
- On Skylake-X: 1 cycle latency and executes only on port 5.
60+
- On Ice Lake: 1 cycle latency and executes only on port 5.
61+
- On Zen4: 2 cycles and executes on both port 1 and 2.
62+
- [VPSADBW (ZMM, ZMM, ZMM)](https://uops.info/html-instr/VPSADBW_ZMM_ZMM_ZMM.html)
63+
- On Ice Lake: 3 cycles latency and executes only on port 5.
64+
- On Zen4: 3 cycles and executes on both port 0 and 1.
65+
66+
So when implementing the Jaccard distance, the most important kernel for binary similarity indices using the `VPOPCNTQ`, we will have 4 such instruction invocations that will all stall on the same port number 5:
67+
68+
```c
69+
float jaccard_distance_1024d(uint8_t const* first_vector, uint8_t const* second_vector) {
70+
__m512i const first_start = _mm512_loadu_si512((__m512i const*)(first_vector));
71+
__m512i const first_end = _mm512_loadu_si512((__m512i const*)(first_vector + 64));
72+
__m512i const second_start = _mm512_loadu_si512((__m512i const*)(second_vector));
73+
__m512i const second_end = _mm512_loadu_si512((__m512i const*)(second_vector + 64));
74+
__m512i const intersection_start = _mm512_and_epi64(first_start, second_start);
75+
__m512i const intersection_end = _mm512_and_epi64(first_end, second_end);
76+
__m512i const union_start = _mm512_or_epi64(first_start, second_start);
77+
__m512i const union_end = _mm512_or_epi64(first_end, second_end);
78+
__m512i const population_intersection_start = _mm512_popcnt_epi64(intersection_start);
79+
__m512i const population_intersection_end = _mm512_popcnt_epi64(intersection_end);
80+
__m512i const population_union_start = _mm512_popcnt_epi64(union_start);
81+
__m512i const population_union_end = _mm512_popcnt_epi64(union_end);
82+
__m512i const population_intersection = _mm512_add_epi64(population_intersection_start, population_intersection_end);
83+
__m512i const population_union = _mm512_add_epi64(population_union_start, population_union_end);
84+
return 1.f - _mm512_reduce_add_epi64(population_intersection) * 1.f / _mm512_reduce_add_epi64(population_union);
85+
}
86+
```
87+
88+
That's known to be a bottleneck and can be improved.
89+
90+
### Harley-Seal and Odd-Majority Algorithms
91+
92+
Let's take a look at a few 192-dimensional bit-vectors for simplicity.
93+
Each will be represented with 3x 64-bit unsigned integers.
94+
95+
```cpp
96+
std::uint64_t a[3] = {0x0000000000000000, 0x0000000000000000, 0x0000000000000000};
97+
std::uint64_t b[3] = {0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF};
98+
std::uint64_t c[3] = {0x0000000000000000, 0xFFFFFFFFFFFFFFFF, 0x0000000000000000};
99+
```
100+
101+
A naive Jaccard distance implementation in C++20 using STL's [`std::popcount`](https://en.cppreference.com/w/cpp/numeric/popcount) would look like this:
102+
103+
```cpp
104+
#include <bit>
105+
106+
float jaccard_distance_192d(std::uint64_t const* a, std::uint64_t const* b) {
107+
int intersection = std::popcount(a[0] & b[0]) + std::popcount(a[1] & b[1]) + std::popcount(a[2] & b[2]);
108+
int union_ = std::popcount(a[0] | b[0]) + std::popcount(a[1] | b[1]) + std::popcount(a[2] | b[2]);
109+
return 1.f - intersection * 1.f / union_;
110+
}
111+
```
112+
113+
That's 6x `std::popcount` and we can easily reduce it to 4x, by using the following rule for both `intersection` and `union_`:
114+
115+
$$
116+
\begin{aligned}
117+
X &:= \{x_0,\,x_1,\,x_2\} \\
118+
\operatorname{PopCount}(X) &:= \sum_{i=0}^{2}\operatorname{PopCount}(x_i) \\
119+
\text{Odd} &:= (x_0 \oplus x_1)\,\oplus\, x_2 \\
120+
\text{Major} &:= \bigl((x_0 \oplus x_1)\land x_2\bigr)\,\lor\, (x_0 \land x_1) \\
121+
\curvearrowright \operatorname{PopCount}(X) &:= 2\,\operatorname{PopCount}(\text{Major}) + \operatorname{PopCount}(\text{Odd})
122+
\end{aligned}
123+
$$
124+
125+
That rule is an example of Carry-Save Adders (CSAs) and can be used for longer sequences as well.
126+
So $N$ "Odd-Major" population counts can cover the logic needed for the $2^N-1$ original counts:
127+
128+
- 7x `std::popcount` can be reduced to 3x.
129+
- 15x `std::popcount` can be reduced to 4x.
130+
- 31x `std::popcount` can be reduced to 5x.
131+
132+
When dealing with 1024-dimensional bit-vectors on 64-bit machines, we can view them as 16x 64-bit words, leveraging the "Odd-Majority" trick to fold first 15x `std::popcount` into 4x, and then having 1x more call for the tail, thus shrinking from 16x to 5x calls, at the cost of several `XOR` and `AND` operations.
133+
134+
### Testing and Profiling Kernels
135+
46136
To run the kernel benchmarks, use the following command:
47137

48138
```sh

0 commit comments

Comments
 (0)