-
Notifications
You must be signed in to change notification settings - Fork 45
Add option to normalize vector distances on query #298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -191,3 +191,10 @@ def wrapper(): | |||
return | |||
|
|||
return wrapper | |||
|
|||
|
|||
def norm_cosine_distance(value: float) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add additional check logic to this function kept it simple stupid to start
Nice @rbs333 good start. We do want to enable normalization regardless of distance metric.
|
Is inner product also bounded between 0-2 returned from redis? It is not documented well and if that is the case this formula is expressed wrong in the docs. For euclidean, what normalization were you thinking? It's not obvious to me. ![]()
|
For euclidean we should be able to do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok looking solid, left a few comments. But unfortunately, we might need one more discussion on this as I am reading this code and thinking about 2 things:
- Given L2 is unbounded by definition, is there a scenario here where having this value normalized makes any sense in a use case? We might want to actually take a peek at 2-3 other vector db solutions to see if they document how they normalize these kinds of metrics.
- Secondly, and more importantly, if a user sets
distance_threshold
as a value between [x, y], wouldn't they expect the vector distance they get on the output to match this? Tricky
@tylerhutcherson personally I think we should limit scope of this ticket to just normalize for cosine distance. The customer needs we are addressing, that I have actually witnessed people get confused by, is having an option for a vector similarity type output for cosine. I have not seen a single demo from a competitor or use case from a customer where they used euclidean distance and especially not normalized euclidean distance. I think we risk spinning our wheels on something nobody really cares about. On the second point, yeah this was something I debated as well. Ideally I think we get to a point that for RangeQueries, at very least, normalization is the default option such that a |
9c88680
to
077cd8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NICE
@@ -493,6 +519,14 @@ def set_distance_threshold(self, distance_threshold: float): | |||
raise TypeError("distance_threshold must be of type float or int") | |||
if distance_threshold < 0: | |||
raise ValueError("distance_threshold must be non-negative") | |||
if self._normalize_vector_distance: | |||
if distance_threshold > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
077cd8a
to
d2d1b5c
Compare
This pr accomplishes 2 goals: 1. Add an option for users to easily get back a similarity value between 0 and 1 that they might expect to compare against other vector dbs. 2. Fix the current bug that `distance_threshold` is validated to be between 0 and 1 when in reality it can take values between 0 and 2. > Note: after much careful thought I believe it is best that for `0.5.0` we do **not** start enforcing all distance_thresholds between 0 and 1 and move to this option as default behavior. Ideally this metric would be consistent throughout our code and I don't love supporting this flag but I think it provides the value that is scoped for this ticket while inflicting the least amount of pain and confusion. Changes: 1. Adds the `normalize_vector_distance` flag to VectorQuery and VectorRangeQuery. Behavior: - If set to `True` it normalizes values returned from redis to a value between 0 and 1. - For cosine similarity, it applies `(2 - value)/2`. - For L2 distance, it applies normalization `(1/(1+value))`. - For IP, it does nothing and throws a warning since normalized IP is cosine by definition. - For VectorRangeQuery, if `normalize_vector_distance=True` the distance threshold is now validated to be between 0 and 1 and denormalized for execution against the database to make consistent. 2. Relaxes validation for semantic caching and routing to be between 0 and 2 fixing the current bug and aligning with how the database actually functions.
This pr accomplishes 2 goals: 1. Add an option for users to easily get back a similarity value between 0 and 1 that they might expect to compare against other vector dbs. 2. Fix the current bug that `distance_threshold` is validated to be between 0 and 1 when in reality it can take values between 0 and 2. > Note: after much careful thought I believe it is best that for `0.5.0` we do **not** start enforcing all distance_thresholds between 0 and 1 and move to this option as default behavior. Ideally this metric would be consistent throughout our code and I don't love supporting this flag but I think it provides the value that is scoped for this ticket while inflicting the least amount of pain and confusion. Changes: 1. Adds the `normalize_vector_distance` flag to VectorQuery and VectorRangeQuery. Behavior: - If set to `True` it normalizes values returned from redis to a value between 0 and 1. - For cosine similarity, it applies `(2 - value)/2`. - For L2 distance, it applies normalization `(1/(1+value))`. - For IP, it does nothing and throws a warning since normalized IP is cosine by definition. - For VectorRangeQuery, if `normalize_vector_distance=True` the distance threshold is now validated to be between 0 and 1 and denormalized for execution against the database to make consistent. 2. Relaxes validation for semantic caching and routing to be between 0 and 2 fixing the current bug and aligning with how the database actually functions.
This pr accomplishes 2 goals: 1. Add an option for users to easily get back a similarity value between 0 and 1 that they might expect to compare against other vector dbs. 2. Fix the current bug that `distance_threshold` is validated to be between 0 and 1 when in reality it can take values between 0 and 2. > Note: after much careful thought I believe it is best that for `0.5.0` we do **not** start enforcing all distance_thresholds between 0 and 1 and move to this option as default behavior. Ideally this metric would be consistent throughout our code and I don't love supporting this flag but I think it provides the value that is scoped for this ticket while inflicting the least amount of pain and confusion. Changes: 1. Adds the `normalize_vector_distance` flag to VectorQuery and VectorRangeQuery. Behavior: - If set to `True` it normalizes values returned from redis to a value between 0 and 1. - For cosine similarity, it applies `(2 - value)/2`. - For L2 distance, it applies normalization `(1/(1+value))`. - For IP, it does nothing and throws a warning since normalized IP is cosine by definition. - For VectorRangeQuery, if `normalize_vector_distance=True` the distance threshold is now validated to be between 0 and 1 and denormalized for execution against the database to make consistent. 2. Relaxes validation for semantic caching and routing to be between 0 and 2 fixing the current bug and aligning with how the database actually functions.
This pr accomplishes 2 goals:
distance_threshold
is validated to be between 0 and 1 when in reality it can take values between 0 and 2.Changes:
normalize_vector_distance
flag to VectorQuery and VectorRangeQuery.Behavior:
True
it normalizes values returned from redis to a value between 0 and 1.(2 - value)/2
.(1/(1+value))
.normalize_vector_distance=True
the distance threshold is now validated to be between 0 and 1 and denormalized for execution against the database to make consistent.