-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Problem
I was interested to understand how this extension performs when combining other filters (WHERE
clauses) with vector searches. This is a common issue with vector search indexes and there is a long discussion about it in pgvector at pgvector/pgvector#259 .
I couldn't find an existing issue or mention of the topic in this repo except for this question #77 (comment) so I thought I'd open an issue.
Basically the idea would be to support queries like:
SELECT * FROM posts WHERE user_id = 123 ORDER BY embedding <-> '[3,1,2]' LIMIT 5
Typically in Postgres such queries are optimized with a composite index on user_id
and embedding
. Composite indexes are simple for a BTREE
index but much more complicated for other index types.
Without a composite index Postgres query planner has to choose between loading all posts WHERE user_id = 123
and sorting those in memory and then limiting to the top 5 or it has to look at the entire ordered embedding <-> '[3,1,2]'
list and keep searching until it finds 5 matches where user_id = 123
. The choice usually depends on statistics which indicate which will be least wasteful. But in many applications both options will be very wasteful.
The worst case scenario happens when there are many users and every user has many posts. In that case the posts WHERE user_id = 123
needs to sort a lot of posts in memory and the embedding <-> '[3,1,2]'
option would need to skip many irrelevant posts by different users.
Solution
This issue is mainly opened as a starting point to the conversation. There aren't easy answers here but usually only tradeoffs and I wanted to hear from the maintainers about their thoughts on this problem.
Related research
So far the most thorough discussion I've seen about this topic is in https://harsha-simhadri.org/pubs/Filtered-DiskANN23.pdf but this has not yielded any production grade database nor a Postgres extension to my knowledge.