Skip to content

Support semantic reranking using contextual snippets instead of entire field text #129369

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 58 commits into
base: main
Choose a base branch
from

Conversation

kderusso
Copy link
Member

@kderusso kderusso commented Jun 12, 2025

Followup to the POC described in #128255

Adds the ability to rerank based on a smaller number of snippets.

Example, with a default of one snippet:

GET test/_search
{
  "retriever": {
    "text_similarity_reranker": {
      "retriever": {
        "standard": {
          "query": {
            "term": {
              "other": "lotr"
            }
          }
        }
      },
      "rank_window_size": 2,
      "field": "semantic_text",
      "inference_text": "are all who wander lost?",
      "snippets": { }
    }
  }
}

Example, specifying snippets:

GET test/_search
{
  "retriever": {
    "text_similarity_reranker": {
      "retriever": {
        "standard": {
          "query": {
            "term": {
              "other": "lotr"
            }
          }
        }
      },
      "rank_window_size": 2,
      "field": "semantic_text",
      "inference_text": "are all who wander lost?",
      "snippets": {
        "num_snippets": 3
      }
    }
  }
}

Not specifying snippets will continue to send the entire field contents into the reranker model.

@kderusso kderusso changed the title Rerank snippet POC Support semantic reranking using contextual snippets instead of entire field text Jul 2, 2025
Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass and left some comments. I think we can better isolate this change in the text similarity rank builder.

Copy link

❌ Author of the following commits did not sign a Contributor Agreement:
9ae38c9, 8a82b13, 7c6848d, b6aaf8f, 2418e41

Please, read and sign the above mentioned agreement if you want to contribute to this project

@kderusso
Copy link
Member Author

cla/check

int fragmentSize = tokenSizeLimit * TOKEN_SIZE_LIMIT_MULTIPLIER;
highlightBuilder.fragmentSize(fragmentSize);
SearchHighlightContext searchHighlightContext = highlightBuilder.build(context.getSearchExecutionContext());
context.highlight(searchHighlightContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also set noMatchSize to ensure that we always get a snippet for every document?


import static org.elasticsearch.search.rank.feature.RerankSnippetConfig.DEFAULT_NUM_SNIPPETS;

public class TextSimilarityRerankingRankFeaturePhaseRankShardContext extends RerankingRankFeaturePhaseRankShardContext {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need the separation here between TextSimilarityRerankingRankFeaturePhaseRankShardContext and RerankingRankFeaturePhaseRankShardContext? I don't see RerankingRankFeaturePhaseRankShardContext being used elsewhere so maybe we can just have TextSimilarityRerankingRankFeaturePhaseRankShardContext and implement the full logic there?

@@ -337,7 +337,7 @@ protected SearchSourceBuilder finalizeSourceBuilder(SearchSourceBuilder sourceBu
* @param ctx The query rewrite context
* @return RetrieverBuilder the rewritten retriever
*/
protected RetrieverBuilder doRewrite(QueryRewriteContext ctx) {
protected RetrieverBuilder doRewrite(QueryRewriteContext ctx) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a leftover?

*/
public class SnippetRankInput implements Writeable {

private final RerankSnippetConfig snippets;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the separation between SnippetRankInput and RerankSnippetConfig necessary? Why not injecting numSnippets and snippetQueryBuilder directly here?

/**
* The default token size limit of the Elastic reranker is 512.
*/
private static final int RERANK_TOKEN_SIZE_LIMIT = 512;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently used as a character limit and not as a token limit. I don't understand why you're separating the RERANK_TOKEN_SIZE_LIMIT and the DEFAULT_TOKEN_SIZE_LIMIT. Currently the issue is that HighlightBuilder#fragmentSize sets the size of the fragment in terms of number of characters and not tokens.
512 being the context length of the Elastic re-ranker, we can hack temporarily by multiplying with the average length of a token in English. 4096 seems high since we'd expect an average of 8 characters per token. The literature on the topic is more on an average of 4-5 chars per token even less if we consider the model's vocabulary and tokenisation (wordpiece, ...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants