-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[ENH]: (Rust client): add EF config #5809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
7dd143f to
481b589
Compare
| } | ||
|
|
||
| impl ChromaCollection { | ||
| pub(crate) fn new(client: ChromaHttpClient, collection: Collection) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change ended up not being strictly necessary for this PR but makes things slightly cleaner so I kept it
481b589 to
1842640
Compare
1842640 to
e705502
Compare
|
Embed-Function Config GATs, Dense/Sparse Traits & Collection Helpers Introduces explicit Key Changes• Split generic Affected Areas• rust/chroma/src/embed/* (traits + implementations) This summary was automatically generated by @propel-code-bot |
e705502 to
126ac80
Compare
rust/chroma/src/embed/bm25.rs
Outdated
| token_max_length: usize, | ||
| } | ||
|
|
||
| impl TryInto<EmbeddingFunctionConfiguration> for BM25Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought TryFrom is always preferred over TryInto. Also why is it possible to return error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will replace with TryFrom
error is possible because of serialization
| fn get_config(&self) -> Result<Self::Config, Self::Error> { | ||
| Ok(OllamaEmbeddingFunctionConfig { | ||
| url: self.host.clone(), | ||
| model_name: self.model.clone(), | ||
| timeout: 60, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
The timeout is hardcoded here, which breaks the configuration round-trip property. If an OllamaEmbeddingFunction is created with a non-default timeout via build_from_config, this function will return an incorrect timeout value.
To fix this, the timeout should be stored on the OllamaEmbeddingFunction struct. This requires updating the struct definition, new(), and build_from_config() to handle the timeout field consistently.
For example, this function should be updated to use the stored value:
fn get_config(&self) -> Result<Self::Config, Self::Error> {
Ok(OllamaEmbeddingFunctionConfig {
url: self.host.clone(),
model_name: self.model.clone(),
timeout: self.timeout, // Use stored value
})
}Context for Agents
[**CriticalError**]
The timeout is hardcoded here, which breaks the configuration round-trip property. If an `OllamaEmbeddingFunction` is created with a non-default timeout via `build_from_config`, this function will return an incorrect timeout value.
To fix this, the `timeout` should be stored on the `OllamaEmbeddingFunction` struct. This requires updating the struct definition, `new()`, and `build_from_config()` to handle the `timeout` field consistently.
For example, this function should be updated to use the stored value:
```rust
fn get_config(&self) -> Result<Self::Config, Self::Error> {
Ok(OllamaEmbeddingFunctionConfig {
url: self.host.clone(),
model_name: self.model.clone(),
timeout: self.timeout, // Use stored value
})
}
```
File: rust/chroma/src/embed/ollama.rs
Line: 182| H: TokenHasher + Send + Sync + 'static, | ||
| { | ||
| type Embedding = SparseVector; | ||
| impl SparseEmbeddingFunction for BM25SparseEmbeddingFunction<Bm25Tokenizer, Murmur3AbsHasher> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the generics pinned here?
| Ok(Self { | ||
| tokenizer, | ||
| hasher: Murmur3AbsHasher::default(), | ||
| k: config.k, | ||
| b: config.b, | ||
| avg_len: config.avg_doc_length, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This forces people to use the builtin hasher and tokenizer. Personally I do not like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we supported custom hashers and tokenizers how would we build it in the other languages? i think having the default one is fine no? if they want to have a custom one they can write their own ef
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the hasher and tokenizers we provide in Rust, we need to copy the same logic in other language. We could have a default ef, but not in this way because people cannot impl EmbeddingFunction for BM25SparseEmbeddingFunction<?, ?> themselves without wrapping in around.
rust/chroma/src/embed/ollama.rs
Outdated
| timeout: u64, | ||
| } | ||
|
|
||
| impl TryInto<EmbeddingFunctionConfiguration> for OllamaEmbeddingFunctionConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^
| } | ||
|
|
||
| impl AsRef<str> for Key { | ||
| fn as_ref(&self) -> &str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we have multiple identical impl for Key -> &str, could be better if we unify them
| where | ||
| Self: Sized; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know this isn't necessary at this point for what this PR accomplishes, but can we add a todo for the other trait functions (default space, supported spaces)?
| /// # Ok(()) | ||
| /// # } | ||
| /// ``` | ||
| async fn embed_strs(&self, batches: &[&str]) -> Result<Vec<Vec<f32>>, Self::Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another TODO for the embed_query_strs would be helpful for tracking
| /// # } | ||
| /// ``` | ||
| async fn embed_strs(&self, batches: &[&str]) -> Result<Vec<Self::Embedding>, Self::Error>; | ||
| async fn embed_strs(&self, batches: &[&str]) -> Result<Vec<SparseVector>, Self::Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo for embed_query_strs here as well
| impl AsRef<str> for Key { | ||
| fn as_ref(&self) -> &str { | ||
| match self { | ||
| Key::Document => "#document", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
side note, can we make it so schema supports using Key::Document in source key?
126ac80 to
e14aaec
Compare
e14aaec to
a3e645f
Compare
rescrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer to Sicheng on sparse vector feel. I think this makes sense.
Description of changes
Main changes:
TryInto<EmbeddingFunctionConfiguration>.This does not introduce any machinery to automatically persist/hydrate EFs--currently, users must call
get_config()/build_from_config()themselves (or.try_into()on the config). We can add that as a follow-up later but it's pretty non-trivial to build:query()will have to change which is a significant breaking changeTest plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?