Skip to content

[RFC] Support Sub Query Raw Scores in Hybrid Search #1419

Open
@owaiskazi19

Description

@owaiskazi19

This issue describes details of design for supporting Sub Query Raw Scores in Hybrid Search. This feature has been requested through GitHub issue #1294 & #1180

Problem

Currently, the hybrid search response only includes the final (normalized) score in each SearchHit, after normalization and combination. However, in several use cases—such as reranking, explainability, or custom post-processing—users require visibility into the original (pre-normalized or raw) scores from each subquery.
The lack of access to individual subquery scores limits users to working only with the final hybrid score, which is insufficient for advanced use cases.

Requirements

Functional Requirements

  • Each SearchHit in the hybrid search response should include the original (pre-normalized) scores of its subqueries.
  • Maintain Consistent Sub Query Score Ordering
  • Support for Multiple Shards and Single Shard

Non Functional Requirements

  • Including subquery scores must not significantly impact query response time or introduce regressions in performance.
  • Support Backward Compatibility

Solution Overview

We propose extending the hybrid search response to include a new metadata field: hybridization_sub_query_scores. This field will contain a list of scores corresponding to each subquery executed as part of the hybrid query.

 "hybridization_sub_query_scores": [
                       0.34567,    ---> raw score of sub query 1
                        0.49510515, ---> raw score of sub query 2
                        0.234556    ---> raw score of sub query 3                   
                    ]

Each element in the hybridization_sub_query_scores list corresponds to the score from one of the subqueries. The ordering of scores will follow the internal ordering of subqueries in the hybrid query definition.
The sub query scores can be enabled through a flag sub-query-scores while defining a normalization pipeline for hybrid search. Like the below

{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "sub-query-scores": true,
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.3,
              0.7
            ]
          }
        }
      }
    }
  ]
}

The solution can be achieved with different options:

Option 1: Using getFetchSubPhase extension point [Recommended]

This option uses OpenSearch’s existing FetchSubPhase extension mechanism to inject subquery scores into the SearchHit during the fetch phase. A new HybridizationFetchSubPhase class would be implemented to read subquery scores from a shared registry (e.g., HybridScoreRegistry) populated during the query execution phase.

How It Works

Image

  • During the normalization phase, individual subquery scores are collected and stored in a registry HybridScoreRegistry
  • During the fetch phase, the HybridizationFetchSubPhase retrieves the per-document subquery scores from the registry and inserts them into each SearchHit under a new field (e.g., _hybridization).
  • This approach avoids altering the core scoring or asking user to add a new processor and neatly integrates into the OpenSearch plugin architecture.

Note: In the case of single shard there can be a flow where fetch phase can run before the query phase, there we need to update the SearchHit with the subquery scores here.

Pros

  • Aligns with existing extensibility patterns in OpenSearch.
  • Decouples logic from core query processing — safe, modular, and maintainable.
  • Allows clear separation of concerns between scoring and rendering the response.

Cons

  • Slight memory overhead for storing intermediate scores (should be acceptable for typical query sizes).

Option 2: Using SearchResponse processor

In this approach, subquery scores would be injected into the search response after the fetch phase but before the response is serialized. This could be implemented using a new response processor.

How It Works

  • Create a new SubQueryScoresResponseProcessor in neural search to alter the response
  • For each SearchHit, it adds the corresponding subquery scores from a shared map or context.
  • This logic happens after all fetch phases have completed.

Pros

  • Does not require changes to query execution or fetch phases.
  • May simplify logic for cross-phase coordination, as everything happens post-query.

Cons

  • Introduces a new response processor. Less modular and less transparent than using a FetchSubPhase.
  • Increases complexity of response handling logic.
  • Tight coupling to internal response format may create upgrade and compatibility issues.

Option 3: Creating a query parameter for hybrid scores in core

This approach proposes adding a new query-level parameter (?hybridScores=true) to the search request itself. When this flag is set, the query engine internally stores and returns the subquery scores as part of the standard SearchHit. This is very similar to how verbose pipeline works in Search Pipeline currently.

How It Works

  • Modify the SearchSourceBuilder in core to support sub query scores in the source field.
  • Pass the sub query scores from neural search plugin to core.

Pros

  • Clean and visible user interface through query parameters.

Cons

  • Tightly couples response formatting to query logic — violates separation of concerns.
  • Increases the complexity and size of the core hybrid query code.
  • Higher risk of introducing performance regressions or bugs.
  • Harder to maintain and test compared to using fetch extensions.

Low Level Design

We need the following changes:

  • normalize() method would return a map of docIds and associated subqueryScores
  • a new class HybridizationFetchSubPhase to inject subquery scores into the SearchHit during the fetch phase.
  • a new class HybridScoreRegistry to store the subQueryScores with associated search context.

Image

The HybridizationFetchSubPhase would like the below to add _hybridization field with subqueryScores.

public class HybridizationFetchSubPhase implements FetchSubPhase {

    public HybridizationFetchSubPhase() {}

    @Override
    public FetchSubPhaseProcessor getProcessor(FetchContext fetchContext) throws IOException {
        SearchContext context = ScoreNormalizer.getSearchContext();

        return new FetchSubPhaseProcessor() {
            LeafReaderContext ctx;

            @Override
            public void setNextReader(LeafReaderContext leafReaderContext) throws IOException {
                this.ctx = leafReaderContext;
            }

            @Override
            public void process(HitContext hitContext) {
                Map<Integer, float[]> scoreMap = HybridScoreRegistry.get(context);
                if (scoreMap == null) {
                    return;
                }
                int docId = hitContext.docId();
                float[] subqueryScores = scoreMap.get(docId);

                if (subqueryScores != null) {
                    // Add it as a field
                    hitContext.hit().setDocumentField("_hybridization", new DocumentField("_hybridization", List.of(subqueryScores)));
                }
            }
        };
    }
}

Benchmarks

OpenSearch cluster consisting of a single r6g.8xlarge instance as the coordinator node along with three r6g.8xlarge instances as data nodes with multiple shards.

Min max normalization

dataset 3.1.0 Sub Query Scores 3.1.0 Sub Query Scores 3.1.0 Sub Query Scores
p50 p50 diff p90 p90 diff p99 p99 diff
scidocs 66.5 66.5 0 70.5 70.5 0 76.005 75.005 -1.31%
fiqa 70 68 -2.86% 74 71.65 -3.18% 77.5 75.5 -2.585
quora 70 70 0 75 74 -1.33% 83 82 -1.20%
arguana 118 117 -0.85% 125.5 124 1.20% 134.5 132 -1.80%

Sub Query Scores yield modest performance gains in p90 and p99, especially for fiqa and arguana, with no regressions.

RRF normalization

dataset 3.1.0 Sub Query Scores 3.1.0 Sub Query Scores 3.1.0 Sub Query Scores
p50 p50 diff p90 p90 diff p99 p99 diff
scidocs 67.5 66 -2.22% 71 69.5 -2.11% 75.505 74.5 -1.33%
fiqa 69.5 67 -3.60% 74 71.5 -3.38% 78.764 74.5 0.67%
quora 72 70 -2.78% 77 74 -3.90% 84 81 -3.57%
arguana 117 117 0 124 124 0 132 131.475 -0.40%

RRF normalization combined with Sub Query Scores shows consistent, deeper improvements, particularly for fiqa and quora, improving tail latencies (p99).

Will perform another round of benchmarks

Feedback Required

We greatly value feedback from the community to ensure that this proposal addresses real-world use cases effectively.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions