Using HNSW index in complex queries

I am experimenting with this extension to implement clustering based on the embeddings generated by a LLM for short text strings.

Consider the following table (at the moment it is filled with around 100K rows, but I'll need to scale up to 100M rows).

```sql
CREATE TABLE "items" (
    id BLOB,
    my_text VARCHAR,
    embedding FLOAT32[3072]
);
CREATE INDEX items_embedding_index ON 'items' USING HNSW (embedding) WITH (metric = 'cosine')
```

My goal is to merge different items into clusters and save them in a table like the following where items sharing the same `cluster_id` belongs to the same cluster.

```sql
CREATE TABLE "clusters" (
    cluster_id BLOB,
    item_id BLOB
);
```

In my use case clusters correspond to the connected components of the graph having the items as vertices and edges that joins two vertices whenever the cosine distance of their embeddings is less than a given threshold (e.g. 0.15).
Once the graph is built, computing the connected components requires almost linear time using the [Union-Find algorithm](https://en.wikipedia.org/wiki/Disjoint-set_data_structure).

I am having troubles to build a query able to compute the edges of the graph fast enough.
The kind of query I need to run is as follows:

```sql
SELECT i.*, j.*
FROM items AS i
INNER JOIN items AS j ON array_cosine_distance(i.embedding, j.embedding) < 0.15 AND a.id < b.id
```

However, the query does no use the HSNW index, resulting in two sequential scans that lead to huge execution times:

```text
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│           my_text         │
│         embedding         │
│             id            │
│           my_text         │
│         embedding         │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│   (array_cosine_distance  │
│ (embedding, embedding) < 0│
│            .15)           │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    PIECEWISE_MERGE_JOIN   │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id < id    ├──────────────┐
│                           │              │
│        ~247067 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│           items           ││           items           │
│                           ││                           │
│        Projections:       ││        Projections:       │
│         embedding         ││         embedding         │
│             id            ││             id            │
│          my_text          ││          my_text          │
│                           ││                           │
│        ~98827 Rows        ││        ~98827 Rows        │
└───────────────────────────┘└───────────────────────────┘
```

As far as I understand, this case is not yet handled by the vss extension.
Here are my questions:

  1. In theory, could this kind of query benefit from the existance of a HNSW index?
  2. If so, do you plan to extend the implementation to cover this use case?

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using HNSW index in complex queries #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using HNSW index in complex queries #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions