-
Notifications
You must be signed in to change notification settings - Fork 448
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Hello!
Context
We released Sentence Transformers v5 yesterday, introducing a new SparseEncoder
model category. These are a subclass of SentenceTransformer
that output sparse vectors instead of dense embeddings.
Feature Description
Add support for SparseEncoder
models in MTEB. I tested the current version and found one main blocking issue in the Sentence TransformerWrapper
to make it compatible :
embeddings = embeddings.cpu().detach().float().numpy() |
with the .numpy (applied on a sparse tensor it's breaking and not wanted anyway)
Questions
- Is there a specific reason for the
.numpy()
conversion? Can we make it conditional for sparse models? - For similarity computation, sparse models have an optimized
similarity
function that should be use this instead of MTEB's standard similarity functions, is it gonna be always the case ? Because if dense way of computing the similarity is used it will broke or be really long. - Since no sparse indexes exist in MTEB, encoding by chunk would be the best option to make sure the similarity compute is reasonable but is it plan to handle other indexes ?
This would probably be part of a bigger refactor of the Sentence Transformers handling with #2871 and other possible feature.
Would love to hear your thoughts thist.
cc @tomaarsen
Arthur Bresnu
tomaarsen and KennethEnevoldsen
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request