-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Support semantic reranking using contextual snippets instead of entire field text #129369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass and left some comments. I think we can better isolate this change in the text similarity rank builder.
server/src/main/java/org/elasticsearch/search/rank/RankBuilder.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/rank/feature/CustomRankInput.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/search/rank/feature/RankFeatureShardPhase.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/search/rank/context/RankFeaturePhaseRankCoordinatorContext.java
Outdated
Show resolved
Hide resolved
…rity reranker only
cla/check |
int fragmentSize = tokenSizeLimit * TOKEN_SIZE_LIMIT_MULTIPLIER; | ||
highlightBuilder.fragmentSize(fragmentSize); | ||
SearchHighlightContext searchHighlightContext = highlightBuilder.build(context.getSearchExecutionContext()); | ||
context.highlight(searchHighlightContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also set noMatchSize
to ensure that we always get a snippet for every document?
|
||
import static org.elasticsearch.search.rank.feature.RerankSnippetConfig.DEFAULT_NUM_SNIPPETS; | ||
|
||
public class TextSimilarityRerankingRankFeaturePhaseRankShardContext extends RerankingRankFeaturePhaseRankShardContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need the separation here between TextSimilarityRerankingRankFeaturePhaseRankShardContext
and RerankingRankFeaturePhaseRankShardContext
? I don't see RerankingRankFeaturePhaseRankShardContext
being used elsewhere so maybe we can just have TextSimilarityRerankingRankFeaturePhaseRankShardContext
and implement the full logic there?
@@ -337,7 +337,7 @@ protected SearchSourceBuilder finalizeSourceBuilder(SearchSourceBuilder sourceBu | |||
* @param ctx The query rewrite context | |||
* @return RetrieverBuilder the rewritten retriever | |||
*/ | |||
protected RetrieverBuilder doRewrite(QueryRewriteContext ctx) { | |||
protected RetrieverBuilder doRewrite(QueryRewriteContext ctx) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a leftover?
*/ | ||
public class SnippetRankInput implements Writeable { | ||
|
||
private final RerankSnippetConfig snippets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the separation between SnippetRankInput
and RerankSnippetConfig
necessary? Why not injecting numSnippets
and snippetQueryBuilder
directly here?
/** | ||
* The default token size limit of the Elastic reranker is 512. | ||
*/ | ||
private static final int RERANK_TOKEN_SIZE_LIMIT = 512; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently used as a character limit and not as a token limit. I don't understand why you're separating the RERANK_TOKEN_SIZE_LIMIT
and the DEFAULT_TOKEN_SIZE_LIMIT
. Currently the issue is that HighlightBuilder#fragmentSize
sets the size of the fragment in terms of number of characters and not tokens.
512
being the context length of the Elastic re-ranker, we can hack temporarily by multiplying with the average length of a token in English. 4096
seems high since we'd expect an average of 8 characters per token. The literature on the topic is more on an average of 4-5 chars per token even less if we consider the model's vocabulary and tokenisation (wordpiece, ...).
Followup to the POC described in #128255
Adds the ability to rerank based on a smaller number of snippets.
Example, with a default of one snippet:
Example, specifying snippets:
Not specifying snippets will continue to send the entire field contents into the reranker model.