Skip to content

Conversation

@codetheweb
Copy link
Contributor

@codetheweb codetheweb commented Nov 4, 2025

Description of changes

Main changes:

  • There are now separate sparse and dense EF traits.
  • EF traits have a config GAT which must implement TryInto<EmbeddingFunctionConfiguration>.

This does not introduce any machinery to automatically persist/hydrate EFs--currently, users must call get_config() / build_from_config() themselves (or .try_into() on the config). We can add that as a follow-up later but it's pretty non-trivial to build:

  • must implement a registry system to map EF names to implementations, allowing third-party crates to register custom EFs
  • need to remove GATs or type erase EFs so we can store refs to generic EFs during hydration
  • signatures of methods like query() will have to change which is a significant breaking change

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@codetheweb codetheweb force-pushed the feat-rust-client-ef-config branch 2 times, most recently from 7dd143f to 481b589 Compare November 5, 2025 01:47
}

impl ChromaCollection {
pub(crate) fn new(client: ChromaHttpClient, collection: Collection) -> Self {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change ended up not being strictly necessary for this PR but makes things slightly cleaner so I kept it

@codetheweb codetheweb force-pushed the feat-rust-client-ef-config branch from 481b589 to 1842640 Compare November 5, 2025 01:51
@codetheweb codetheweb changed the title [ENH]: (Rust client): add EF config & auto-hydration [ENH]: (Rust client): add EF config Nov 5, 2025
@codetheweb codetheweb force-pushed the feat-rust-client-ef-config branch from 1842640 to e705502 Compare November 5, 2025 02:03
@codetheweb codetheweb marked this pull request as ready for review November 5, 2025 02:03
@codetheweb codetheweb requested a review from rescrv as a code owner November 5, 2025 02:03
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Nov 5, 2025

Embed-Function Config GATs, Dense/Sparse Traits & Collection Helpers

Introduces explicit DenseEmbeddingFunction and SparseEmbeddingFunction traits with a config GAT, enabling round-trip serialise / deserialise of embedding-function parameters. Built-in BM25 and Ollama implementations are updated to comply, including config structs, TryFromTryInto<EmbeddingFunctionConfiguration> bridges, and helper builders. Supporting refactors touch collection/client helpers, schema validation and several utility impls (Key: AsRef<str>).

Key Changes

• Split generic EmbeddingFunction into DenseEmbeddingFunction and SparseEmbeddingFunction in rust/chroma/src/embed/mod.rs with new required fns build_from_config, get_config, get_name.
• Added config structs BM25Config and OllamaEmbeddingFunctionConfig + TryFromEmbeddingFunctionConfiguration implementations for persistence.
• Refactored BM25SparseEmbeddingFunction and OllamaEmbeddingFunction to GAT-based config, added error variants for (de)serialisation, simplified encode path (BM25 now infallible).
• Client/collection QoL: centralised ChromaCollection::new, replaced ad-hoc struct literal construction; ChromaHttpClient list/create now use that helper.
• Execution layer: added duplicate impl AsRef<str> for Key (and used in frontend key validation).
• Frontend fix: schema validation now calls key.as_ref() instead of allocating to_string().
• Misc clean-ups: hard-coded timeouts noted, TODOs added, minor doc / test updates.

Affected Areas

• rust/chroma/src/embed/* (traits + implementations)
• rust/chroma/src/client/chroma_http_client.rs
• rust/chroma/src/collection.rs
• rust/types/src/execution/operator.rs
• rust/frontend/src/impls/service_based_frontend.rs

This summary was automatically generated by @propel-code-bot

@codetheweb codetheweb force-pushed the feat-rust-client-ef-config branch from e705502 to 126ac80 Compare November 5, 2025 02:08
token_max_length: usize,
}

impl TryInto<EmbeddingFunctionConfiguration> for BM25Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought TryFrom is always preferred over TryInto. Also why is it possible to return error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will replace with TryFrom

error is possible because of serialization

Comment on lines +176 to +182
fn get_config(&self) -> Result<Self::Config, Self::Error> {
Ok(OllamaEmbeddingFunctionConfig {
url: self.host.clone(),
model_name: self.model.clone(),
timeout: 60,
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

The timeout is hardcoded here, which breaks the configuration round-trip property. If an OllamaEmbeddingFunction is created with a non-default timeout via build_from_config, this function will return an incorrect timeout value.

To fix this, the timeout should be stored on the OllamaEmbeddingFunction struct. This requires updating the struct definition, new(), and build_from_config() to handle the timeout field consistently.

For example, this function should be updated to use the stored value:

    fn get_config(&self) -> Result<Self::Config, Self::Error> {
        Ok(OllamaEmbeddingFunctionConfig {
            url: self.host.clone(),
            model_name: self.model.clone(),
            timeout: self.timeout, // Use stored value
        })
    }
Context for Agents
[**CriticalError**]

The timeout is hardcoded here, which breaks the configuration round-trip property. If an `OllamaEmbeddingFunction` is created with a non-default timeout via `build_from_config`, this function will return an incorrect timeout value.

To fix this, the `timeout` should be stored on the `OllamaEmbeddingFunction` struct. This requires updating the struct definition, `new()`, and `build_from_config()` to handle the `timeout` field consistently.

For example, this function should be updated to use the stored value:
```rust
    fn get_config(&self) -> Result<Self::Config, Self::Error> {
        Ok(OllamaEmbeddingFunctionConfig {
            url: self.host.clone(),
            model_name: self.model.clone(),
            timeout: self.timeout, // Use stored value
        })
    }
```

File: rust/chroma/src/embed/ollama.rs
Line: 182

H: TokenHasher + Send + Sync + 'static,
{
type Embedding = SparseVector;
impl SparseEmbeddingFunction for BM25SparseEmbeddingFunction<Bm25Tokenizer, Murmur3AbsHasher> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the generics pinned here?

Comment on lines +153 to +159
Ok(Self {
tokenizer,
hasher: Murmur3AbsHasher::default(),
k: config.k,
b: config.b,
avg_len: config.avg_doc_length,
})
Copy link
Contributor

@Sicheng-Pan Sicheng-Pan Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This forces people to use the builtin hasher and tokenizer. Personally I do not like this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we supported custom hashers and tokenizers how would we build it in the other languages? i think having the default one is fine no? if they want to have a custom one they can write their own ef

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the hasher and tokenizers we provide in Rust, we need to copy the same logic in other language. We could have a default ef, but not in this way because people cannot impl EmbeddingFunction for BM25SparseEmbeddingFunction<?, ?> themselves without wrapping in around.

timeout: u64,
}

impl TryInto<EmbeddingFunctionConfiguration> for OllamaEmbeddingFunctionConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

}

impl AsRef<str> for Key {
fn as_ref(&self) -> &str {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we have multiple identical impl for Key -> &str, could be better if we unify them

where
Self: Sized;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know this isn't necessary at this point for what this PR accomplishes, but can we add a todo for the other trait functions (default space, supported spaces)?

/// # Ok(())
/// # }
/// ```
async fn embed_strs(&self, batches: &[&str]) -> Result<Vec<Vec<f32>>, Self::Error>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another TODO for the embed_query_strs would be helpful for tracking

/// # }
/// ```
async fn embed_strs(&self, batches: &[&str]) -> Result<Vec<Self::Embedding>, Self::Error>;
async fn embed_strs(&self, batches: &[&str]) -> Result<Vec<SparseVector>, Self::Error>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo for embed_query_strs here as well

impl AsRef<str> for Key {
fn as_ref(&self) -> &str {
match self {
Key::Document => "#document",
Copy link
Contributor

@jairad26 jairad26 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

side note, can we make it so schema supports using Key::Document in source key?

@codetheweb codetheweb force-pushed the feat-rust-client-ef-config branch from 126ac80 to e14aaec Compare November 5, 2025 02:20
@codetheweb codetheweb force-pushed the feat-rust-client-ef-config branch from e14aaec to a3e645f Compare November 5, 2025 02:26
Copy link
Contributor

@rescrv rescrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer to Sicheng on sparse vector feel. I think this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants