Skip to content

Composite indexes and filtered searches #121

@DylanGriffith

Description

@DylanGriffith

Problem

I was interested to understand how this extension performs when combining other filters (WHERE clauses) with vector searches. This is a common issue with vector search indexes and there is a long discussion about it in pgvector at pgvector/pgvector#259 .

I couldn't find an existing issue or mention of the topic in this repo except for this question #77 (comment) so I thought I'd open an issue.

Basically the idea would be to support queries like:

SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5 

Typically in Postgres such queries are optimized with a composite index on user_id and embedding. Composite indexes are simple for a BTREE index but much more complicated for other index types.

Without a composite index Postgres query planner has to choose between loading all posts WHERE user_id = 123 and sorting those in memory and then limiting to the top 5 or it has to look at the entire ordered embedding <-> '[3,1,2]' list and keep searching until it finds 5 matches where user_id = 123. The choice usually depends on statistics which indicate which will be least wasteful. But in many applications both options will be very wasteful.

The worst case scenario happens when there are many users and every user has many posts. In that case the posts WHERE user_id = 123 needs to sort a lot of posts in memory and the embedding <-> '[3,1,2]' option would need to skip many irrelevant posts by different users.

Solution

This issue is mainly opened as a starting point to the conversation. There aren't easy answers here but usually only tradeoffs and I wanted to hear from the maintainers about their thoughts on this problem.

Related research

So far the most thorough discussion I've seen about this topic is in https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf but this has not yielded any production grade database nor a Postgres extension to my knowledge.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions