Skip to content

[RFC] Add upper bound parameter for min_max normalization technique #1440

Open
@ryanbogan

Description

@ryanbogan

Introduction

This document discusses the design of the upper bound feature for min-max score normalization technique in OpenSearch's hybrid search capability, complementing the existing lower bound feature.

Problem Statement

The current min-max normalization can produce misleading relevancy scores when the theoretical maximum score is known but differs from the actual maximum score in the result set. In neural/k-NN search scenarios where scores have known theoretical bounds (e.g., [0.75, 1.0]), the current normalization can overstate document relevance by normalizing to the actual maximum score rather than the theoretical maximum. Users who need more precise control over score normalization can use the upper bound feature to improve the relevance of their results.

Requirements

Functional Requirements

  1. Support configurable upper bounds at sub-query level
  2. Provide a way to define a score for upper bound, which can be ignored if needed.
  3. Allow independent upper bound configuration for each sub-query
  4. Ensure proper interaction with lower bound feature while maintaining its existing behavior

Non-Functional Requirements

  1. Minimal performance impact on score normalization

Current State

The min-max normalization technique currently:

  • Uses actual retrieved scores to find minimum and maximum scores for normalization
  • Has a lower bound feature implemented through LowerBound class with an inner Mode enum (APPLY, CLIP, IGNORE)
  • Contains bound-related logic directly within the normalization class

Current Score Calculation Formula

normalized_score = (score - min_score) / (max_score - min_score)

Note: min_score is changed depending on the LowerBound.Mode being used

Example

Image

In the example above, consider a scenario where scores theoretically range from 0.0 to 1.0. When a query returns scores [0.75, 0.76, 0.77], the current normalization process treats:

  • 0.75 as the minimum, normalizing it to 0.0
  • 0.77 as the maximum, normalizing it to 1.0
  • 0.76 as the midpoint, normalizing it to 0.5

While the existing lower bound feature can address score distortion at the lower end by setting a minimum threshold, there is no equivalent mechanism for the upper end. This creates a significant distortion in relevancy representation. Despite all scores being clustered between 0.75-0.77, the normalization spreads them across the entire range from 0.0 to 1.0, suggesting much larger relevancy differences than actually exist. The current implementation lacks the ability to fully contextualize these scores within their theoretical range, where they all represent highly relevant documents with scores close to the maximum possible value of 1.0.

Solution HLD

Proposed Solution

Image

The proposed solution introduces an upper bound feature to complement the existing lower bound functionality in the min-max score normalization technique. This will be achieved through the following architectural changes:

  1. Abstract Base Class: Create a new ScoreBound abstract class to encapsulate common behavior for both upper and lower bounds.
  2. Bound Mode Enum: Extract the existing LowerBound.Mode into a standalone BoundMode enum to be used by both bound types.
  3. Upper Bound Implementation: Introduce a new UpperBound class extending ScoreBound to handle upper bound logic.
  4. Refactor Existing Lower Bound: Modify the LowerBound class to extend ScoreBound and use the new BoundMode enum.
  5. Enhanced Normalization Technique: Update MinMaxScoreNormalizationTechnique to support both upper and lower bounds using a common interface.

API Configuration

{
  "normalization": {
    "technique": "min_max",
    "parameters": {
      "lower_bounds": [
        { 
          "mode": "apply",
          "min_score": 0.0
        },
        { 
          "mode": "clip",
          "min_score": 0.0
        },
        {
          "mode": "ignore"
        }
      ],
      "upper_bounds": [
        {
          "mode": "apply",
          "max_score": 1.0
        },
        {
          "mode": "clip",
          "max_score": 1.0
        },
        {
          "mode": "ignore"
        }
      ]
    }
  }
}

Key Design Decisions

Standalone Bound Mode Enum

  • Decision: Extract Mode from LowerBound into a separate BoundMode enum
  • Rationale: Allows shared use between upper and lower bounds, improving consistency and maintainability

Symmetrical Upper Bound Implementation

  • Decision: Implement UpperBound similarly to LowerBound
  • Rationale: Provides a consistent API and behavior for users, simplifying understanding and usage

Minimal Changes to Existing API

  • Decision: Extend the current configuration structure by adding upper_bounds alongside lower_bounds, without modifying the existing lower_bounds structure or behavior
  • Rationale: Addresses the functional requirement to maintain current functionality for lower bounds. Ensures proper interaction between upper and lower bounds while preserving existing lower bound behavior, allowing users to adopt the new feature without impacting their current queries

Bound Processing in Normalization Technique

  • Decision: Process both bounds within the normalizeSingleScore method
  • Rationale: Centralizes bound logic, ensuring correct interaction between upper and lower bounds

Solution LLD

Image

New Score Calculation Formula

normalized_score = (score - effective_min_score) / (effective_max_score - effective_min_score)

Preliminary Benchmarking

Initial benchmarking shows improvements in relevance metrics when using bounds in some scenarios. Here are two examples:

Example 1: Upper Bounds (nfcorpus dataset)

Metric Default With Upper Bound Improvement
NDCG@5 0.3343 0.3379 1.10%
NDCG@10 0.303 0.3017 -0.40%
NDCG@100 0.2671 0.2691 0.70%

Example 2: Combined Lower/Upper Bounds (TREC-COVID dataset)

Metric Default With Bounds Improvement
NDCG@5 0.6025 0.6707 11.30%
NDCG@10 0.5518 0.6218 12.70%
NDCG@100 0.3859 0.4318 11.90%

Note: These results are from specific test configurations. Results may vary depending on the nature of queries, index settings, and characteristics of the dataset

Testing

Unit Tests:

  • Upper bound configuration parsing
  • Score normalization with different modes
  • Integration with lower bounds
  • Edge cases and error conditions

Integration Tests:

  • All three upper bound modes
  • Integration with lower bounds

Community Feedback

We appreciate all feedback from the community on this RFC. In addition, we are particularly interested in your thoughts on the following questions:

  1. Would you prefer additional configuration options beyond what's proposed?
  2. How should the system behave when both upper and lower bounds are specified in potentially conflicting ways?
  3. How would you combine this with other scoring techniques in your current implementations?
  4. What types of examples would help you understand when and how to use upper bounds effectively?

Please share your feedback through comments on this RFC, GitHub issues, or pull requests with proposed changes.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions