Skip to content

[Feature Request]: Change default behavior of ranker from intersection to union #5852

@kylediaz

Description

@kylediaz

Describe the problem

r1 = Knn(query=query)
r2 = Knn(query=query)
search = Search(
    rank=r1 + r2,
)

Each Knn operator will find two different sets of K documents. Depending on the value of default in the Knn operator, when evaluating r1 + r2, the executor will either return the union or the intersection of these two sets. By default, default=None, which makes the executor find the intersection.

I propose that we should make the default behavior to union.

It's easy to stumble into this behavior

Suppose someone starts with normal Knn search, and then later adds another ranker to the expression

search = Search(
    # rank=Knn(query=query),
    rank=Knn(query=query) + .3 * Knn(query=query, key="sparse_embedding"),
)

All of a sudden, the number of documents returned in the search keeps changing! Is there something wrong with my sparse vectors? It can be very confusing!

Imagine another scenario where I wanted to implement query expansion. This is what would happen to me:

queries = llm_expand_query(query)
search = Search(
    rank=sum(Knn(query=q) for q in queries),
)

All of a sudden, the search API is returning 0 documents!

Other

  • This assumes precision >> recall. Intersection is good if you only want to show documents deemed relevant by multiple sources. Chroma is designed primarily for AI applications. LLMs are generally pretty lenient regarding precision nowadays, but recall is very important. There is a large risk that relevant documents are not included in the intersected set.
  • This is only theoretically good for hybrid search. In practice, I've found very little overlap between dense and hybrid search. Additionally, this does not make sense for things other than hybrid search - for example, you would not expect overlap if you were to use this for query expansion.
  • I was a little suspicious if commutativity and associativity was broken because of this. In my tests it doesn't seem broken, but will our users worry about the order in which these rankers are combined?

Describe the proposed solution

  1. I wish we could simply change the default value of Knn(default=None) to the "lowest value", but when use_rank=True, it should be f32.MAX and when use_rank=False, it should be f32.MIN.
  2. This is important enough to me that I would silently change this behavior. Disable "intersection" mode and only allow union mode. When default=None, the executor treats None as the "lowest" value depending on the context.

Alternatives considered

No response

Importance

would make my life easier

Additional Information

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions