Skip to content

Integrate SparseEncoder model from SentenceTransformers v5 #2873

@arthurbr11

Description

@arthurbr11

Hello!

Context

We released Sentence Transformers v5 yesterday, introducing a new SparseEncoder model category. These are a subclass of SentenceTransformer that output sparse vectors instead of dense embeddings.

Feature Description

Add support for SparseEncoder models in MTEB. I tested the current version and found one main blocking issue in the Sentence TransformerWrapper to make it compatible :

embeddings = embeddings.cpu().detach().float().numpy()

with the .numpy (applied on a sparse tensor it's breaking and not wanted anyway)

Questions

  1. Is there a specific reason for the .numpy() conversion? Can we make it conditional for sparse models?
  2. For similarity computation, sparse models have an optimized similarity function that should be use this instead of MTEB's standard similarity functions, is it gonna be always the case ? Because if dense way of computing the similarity is used it will broke or be really long.
  3. Since no sparse indexes exist in MTEB, encoding by chunk would be the best option to make sure the similarity compute is reasonable but is it plan to handle other indexes ?

This would probably be part of a bigger refactor of the Sentence Transformers handling with #2871 and other possible feature.

Would love to hear your thoughts thist.

cc @tomaarsen

Arthur Bresnu

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions