Skip to content

Add option to normalize vector distances on query #298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 31, 2025

Conversation

rbs333
Copy link
Collaborator

@rbs333 rbs333 commented Mar 24, 2025

This pr accomplishes 2 goals:

  1. Add an option for users to easily get back a similarity value between 0 and 1 that they might expect to compare against other vector dbs.
  2. Fix the current bug that distance_threshold is validated to be between 0 and 1 when in reality it can take values between 0 and 2.

Note: after much careful thought I believe it is best that for 0.5.0 we do not start enforcing all distance_thresholds between 0 and 1 and move to this option as default behavior. Ideally this metric would be consistent throughout our code and I don't love supporting this flag but I think it provides the value that is scoped for this ticket while inflicting the least amount of pain and confusion.

Changes:

  1. Adds the normalize_vector_distance flag to VectorQuery and VectorRangeQuery.

Behavior:

  • If set to True it normalizes values returned from redis to a value between 0 and 1.
  • For cosine similarity, it applies (2 - value)/2.
  • For L2 distance, it applies normalization (1/(1+value)).
  • For IP, it does nothing and throws a warning since normalized IP is cosine by definition.
  • For VectorRangeQuery, if normalize_vector_distance=True the distance threshold is now validated to be between 0 and 1 and denormalized for execution against the database to make consistent.
  1. Relaxes validation for semantic caching and routing to be between 0 and 2 fixing the current bug and aligning with how the database actually functions.

@rbs333 rbs333 requested a review from tylerhutcherson March 24, 2025 14:53
@@ -191,3 +191,10 @@ def wrapper():
return

return wrapper


def norm_cosine_distance(value: float) -> float:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add additional check logic to this function kept it simple stupid to start

@tylerhutcherson
Copy link
Collaborator

Nice @rbs333 good start. We do want to enable normalization regardless of distance metric.

  1. So imagine a map between chosen/indexed distance metric and the normalization function by distance metric. So in that vein, we can rename the param to a generic normalize naming!
  2. Redis maps cosine distance to 0-2 both in input and output. and I think our goal is to normalize it to 0-1 instead (which is how most users think of and use our implementation anyways because 0-2 is not well documented on redis docs.
  3. there are some places where we currently (wrongly) enforce the distance threshold between 0-1, even though it's technically 0-2. (see semantic cache base class). this will be "correct" once you make this change as we can internally control that normalization param on the semantic cache query definitions (internals)

@rbs333
Copy link
Collaborator Author

rbs333 commented Mar 24, 2025

Nice @rbs333 good start. We do want to enable normalization regardless of distance metric.

  1. So imagine a map between chosen/indexed distance metric and the normalization function by distance metric. So in that vein, we can rename the param to a generic normalize naming!
  2. Redis maps cosine distance to 0-2 both in input and output. and I think our goal is to normalize it to 0-1 instead (which is how most users think of and use our implementation anyways because 0-2 is not well documented on redis docs.
  3. there are some places where we currently (wrongly) enforce the distance threshold between 0-1, even though it's technically 0-2. (see semantic cache base class). this will be "correct" once you make this change as we can internally control that normalization param on the semantic cache query definitions (internals)
  1. Heard and can do.

Is inner product also bounded between 0-2 returned from redis? It is not documented well and if that is the case this formula is expressed wrong in the docs.

For euclidean, what normalization were you thinking? It's not obvious to me.

image
  1. As implemented currently the output of the normalization is between 0 and 1.
  2. Heard and will update.

@tylerhutcherson
Copy link
Collaborator

Nice @rbs333 good start. We do want to enable normalization regardless of distance metric.

  1. So imagine a map between chosen/indexed distance metric and the normalization function by distance metric. So in that vein, we can rename the param to a generic normalize naming!
  2. Redis maps cosine distance to 0-2 both in input and output. and I think our goal is to normalize it to 0-1 instead (which is how most users think of and use our implementation anyways because 0-2 is not well documented on redis docs.
  3. there are some places where we currently (wrongly) enforce the distance threshold between 0-1, even though it's technically 0-2. (see semantic cache base class). this will be "correct" once you make this change as we can internally control that normalization param on the semantic cache query definitions (internals)
  1. Heard and can do.

Is inner product also bounded between 0-2 returned from redis? It is not documented well and if that is the case this formula is expressed wrong in the docs.

For euclidean, what normalization were you thinking? It's not obvious to me.

image 2. As implemented currently the output of the normalization is between 0 and 1. 3. Heard and will update.

For euclidean we should be able to do 1/(1-raw_dist) since the raw distance value is unbounded on the upper limit.

@rbs333 rbs333 added the enhancement New feature or request label Mar 25, 2025
@tylerhutcherson tylerhutcherson changed the title add normalize cosine distance flag Add option to normalize vector distances on query Mar 26, 2025
Copy link
Collaborator

@tylerhutcherson tylerhutcherson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok looking solid, left a few comments. But unfortunately, we might need one more discussion on this as I am reading this code and thinking about 2 things:

  • Given L2 is unbounded by definition, is there a scenario here where having this value normalized makes any sense in a use case? We might want to actually take a peek at 2-3 other vector db solutions to see if they document how they normalize these kinds of metrics.
  • Secondly, and more importantly, if a user sets distance_threshold as a value between [x, y], wouldn't they expect the vector distance they get on the output to match this? Tricky

@rbs333
Copy link
Collaborator Author

rbs333 commented Mar 26, 2025

Ok looking solid, left a few comments. But unfortunately, we might need one more discussion on this as I am reading this code and thinking about 2 things:

  • Given L2 is unbounded by definition, is there a scenario here where having this value normalized makes any sense in a use case? We might want to actually take a peek at 2-3 other vector db solutions to see if they document how they normalize these kinds of metrics.
  • Secondly, and more importantly, if a user sets distance_threshold as a value between [x, y], wouldn't they expect the vector distance they get on the output to match this? Tricky

@tylerhutcherson personally I think we should limit scope of this ticket to just normalize for cosine distance. The customer needs we are addressing, that I have actually witnessed people get confused by, is having an option for a vector similarity type output for cosine. I have not seen a single demo from a competitor or use case from a customer where they used euclidean distance and especially not normalized euclidean distance. I think we risk spinning our wheels on something nobody really cares about.

On the second point, yeah this was something I debated as well. Ideally I think we get to a point that for RangeQueries, at very least, normalization is the default option such that a distance_threshold between 0 and 1 always makes sense. The extra troubling thing is that this would work except for the cases involving the semantic_router right now we use aggregations to handle that with distance_threshold where we can't normalize ahead of time which makes things complicated.

@rbs333 rbs333 force-pushed the feat/RAAE-599/distance-normalization branch 2 times, most recently from 9c88680 to 077cd8a Compare March 27, 2025 14:42
Copy link
Collaborator

@tylerhutcherson tylerhutcherson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NICE

@@ -493,6 +519,14 @@ def set_distance_threshold(self, distance_threshold: float):
raise TypeError("distance_threshold must be of type float or int")
if distance_threshold < 0:
raise ValueError("distance_threshold must be non-negative")
if self._normalize_vector_distance:
if distance_threshold > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@abrookins abrookins force-pushed the feat/RAAE-599/distance-normalization branch from 077cd8a to d2d1b5c Compare March 31, 2025 22:51
@abrookins abrookins merged commit 7ffe89e into 0.5.0 Mar 31, 2025
31 checks passed
justin-cechmanek pushed a commit that referenced this pull request Apr 2, 2025
This pr accomplishes 2 goals:
1. Add an option for users to easily get back a similarity value between
0 and 1 that they might expect to compare against other vector dbs.
2. Fix the current bug that `distance_threshold` is validated to be
between 0 and 1 when in reality it can take values between 0 and 2.

> Note: after much careful thought I believe it is best that for `0.5.0`
we do **not** start enforcing all distance_thresholds between 0 and 1
and move to this option as default behavior. Ideally this metric would
be consistent throughout our code and I don't love supporting this flag
but I think it provides the value that is scoped for this ticket while
inflicting the least amount of pain and confusion.

Changes:

1. Adds the `normalize_vector_distance` flag to VectorQuery and
VectorRangeQuery.

Behavior:
- If set to `True` it normalizes values returned from redis to a value
between 0 and 1.
- For cosine similarity, it applies `(2 - value)/2`.
- For L2 distance, it applies normalization `(1/(1+value))`.
- For IP, it does nothing and throws a warning since normalized IP is
cosine by definition.
- For VectorRangeQuery, if `normalize_vector_distance=True` the distance
threshold is now validated to be between 0 and 1 and denormalized for
execution against the database to make consistent.

2. Relaxes validation for semantic caching and routing to be between 0
and 2 fixing the current bug and aligning with how the database actually
functions.
abrookins pushed a commit that referenced this pull request Apr 4, 2025
This pr accomplishes 2 goals:
1. Add an option for users to easily get back a similarity value between
0 and 1 that they might expect to compare against other vector dbs.
2. Fix the current bug that `distance_threshold` is validated to be
between 0 and 1 when in reality it can take values between 0 and 2.

> Note: after much careful thought I believe it is best that for `0.5.0`
we do **not** start enforcing all distance_thresholds between 0 and 1
and move to this option as default behavior. Ideally this metric would
be consistent throughout our code and I don't love supporting this flag
but I think it provides the value that is scoped for this ticket while
inflicting the least amount of pain and confusion.

Changes:

1. Adds the `normalize_vector_distance` flag to VectorQuery and
VectorRangeQuery.

Behavior:
- If set to `True` it normalizes values returned from redis to a value
between 0 and 1.
- For cosine similarity, it applies `(2 - value)/2`.
- For L2 distance, it applies normalization `(1/(1+value))`.
- For IP, it does nothing and throws a warning since normalized IP is
cosine by definition.
- For VectorRangeQuery, if `normalize_vector_distance=True` the distance
threshold is now validated to be between 0 and 1 and denormalized for
execution against the database to make consistent.

2. Relaxes validation for semantic caching and routing to be between 0
and 2 fixing the current bug and aligning with how the database actually
functions.
abrookins pushed a commit that referenced this pull request Apr 4, 2025
This pr accomplishes 2 goals:
1. Add an option for users to easily get back a similarity value between
0 and 1 that they might expect to compare against other vector dbs.
2. Fix the current bug that `distance_threshold` is validated to be
between 0 and 1 when in reality it can take values between 0 and 2.

> Note: after much careful thought I believe it is best that for `0.5.0`
we do **not** start enforcing all distance_thresholds between 0 and 1
and move to this option as default behavior. Ideally this metric would
be consistent throughout our code and I don't love supporting this flag
but I think it provides the value that is scoped for this ticket while
inflicting the least amount of pain and confusion.

Changes:

1. Adds the `normalize_vector_distance` flag to VectorQuery and
VectorRangeQuery.

Behavior:
- If set to `True` it normalizes values returned from redis to a value
between 0 and 1.
- For cosine similarity, it applies `(2 - value)/2`.
- For L2 distance, it applies normalization `(1/(1+value))`.
- For IP, it does nothing and throws a warning since normalized IP is
cosine by definition.
- For VectorRangeQuery, if `normalize_vector_distance=True` the distance
threshold is now validated to be between 0 and 1 and denormalized for
execution against the database to make consistent.

2. Relaxes validation for semantic caching and routing to be between 0
and 2 fixing the current bug and aligning with how the database actually
functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants