Skip to content

Using HNSW index in complex queries #43

@lmores

Description

@lmores

I am experimenting with this extension to implement clustering based on the embeddings generated by a LLM for short text strings.

Consider the following table (at the moment it is filled with around 100K rows, but I'll need to scale up to 100M rows).

CREATE TABLE "items" (
    id BLOB,
    my_text VARCHAR,
    embedding FLOAT32[3072]
);
CREATE INDEX items_embedding_index ON 'items' USING HNSW (embedding) WITH (metric = 'cosine')

My goal is to merge different items into clusters and save them in a table like the following where items sharing the same cluster_id belongs to the same cluster.

CREATE TABLE "clusters" (
    cluster_id BLOB,
    item_id BLOB
);

In my use case clusters correspond to the connected components of the graph having the items as vertices and edges that joins two vertices whenever the cosine distance of their embeddings is less than a given threshold (e.g. 0.15).
Once the graph is built, computing the connected components requires almost linear time using the Union-Find algorithm.

I am having troubles to build a query able to compute the edges of the graph fast enough.
The kind of query I need to run is as follows:

SELECT i.*, j.*
FROM items AS i
INNER JOIN items AS j ON array_cosine_distance(i.embedding, j.embedding) < 0.15 AND a.id < b.id

However, the query does no use the HSNW index, resulting in two sequential scans that lead to huge execution times:

┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             id            │
│           my_text         │
│         embedding         │
│             id            │
│           my_text         │
│         embedding         │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│   (array_cosine_distance  │
│ (embedding, embedding) < 0│
│            .15)           │
│                           │
│        ~247067 Rows       │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    PIECEWISE_MERGE_JOIN   │
│    ────────────────────   │
│      Join Type: INNER     │
│    Conditions: id < id    ├──────────────┐
│                           │              │
│        ~247067 Rows       │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│           items           ││           items           │
│                           ││                           │
│        Projections:       ││        Projections:       │
│         embedding         ││         embedding         │
│             id            ││             id            │
│          my_text          ││          my_text          │
│                           ││                           │
│        ~98827 Rows        ││        ~98827 Rows        │
└───────────────────────────┘└───────────────────────────┘

As far as I understand, this case is not yet handled by the vss extension.
Here are my questions:

  1. In theory, could this kind of query benefit from the existance of a HNSW index?
  2. If so, do you plan to extend the implementation to cover this use case?

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions