-
Notifications
You must be signed in to change notification settings - Fork 448
Description
Describe the bug
I am attempting to evaluate the new Qwen3-Embedding models on ruMTEB but have been unable to reproduce the published scores.
Used MTEB evaluation code and one from Qwen team
As scores were commited on 4th of june and actual version tied to the scores was released on 6th of june (b22da495047858cce924d27d76261e96be6febc0 huggingface commit), I added previous version too for comparison (99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f)
Model addition PR: #2769
Scores addition PR: embeddings-benchmark/results#214
The scores I obtained for Qwen3-Embedding-0.6B:
dataset | original main score | mteb eval (b22da4) | mteb eval (99cabf) | qwen eval (b22da4) | qwen eval (99cabf) |
---|---|---|---|---|---|
TERRa | 0.606803 | 0.561885 | 0.558446 | 0.601651 | 0.565181 |
AILAStatutes | 0.79018 | 0.72796 | 0.41639 | 0.79309 | 0.5805 |
STS22 (eng) | 0.711317 | 0.708369 | 0.659296 | 0.708842 | 0.291827 |
Tried changing versions of transformers (4.53.4/4.52.4/4.52.1/4.51.3), sentence-transformers (4.1.0/5.0.0), MTEB (1.38.9/1.38.30/1.38.34) - no main_score difference in TERRa task by MTEB evaluation.
As in MTEB, model uses all context (32k) and in Qwen eval only 8k context it also gives ~0.01 difference in scores
Also tried FA2 with model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": torch.float16}
, got no difference.
To reproduce
Code for reproduction:
MTEB:
import mteb
from mteb.models.qwen3_models import *
Qwen3_Embedding_0B6 = ModelMeta(
loader=partial(
q3e_instruct_loader,
model_name_or_path=os.environ.get("Q3E_0B6_PATH", "Qwen/Qwen3-Embedding-0.6B"),
# revision='99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f',
revision="b22da495047858cce924d27d76261e96be6febc0",
),
name="Qwen/Qwen3-Embedding-0.6B",
languages=multilingual_langs,
open_weights=True,
# revision="99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f",
revision="b22da495047858cce924d27d76261e96be6febc0",
release_date="2025-06-05",
n_parameters=595776512,
memory_usage_mb=2272,
embed_dim=1024,
max_tokens=32768, # or 8k - 8192
license="apache-2.0",
reference="https://huggingface.co/Qwen/Qwen3-Embedding-0.6B",
similarity_fn_name="cosine",
framework=["Sentence Transformers", "PyTorch"],
use_instructions=True,
public_training_code=None,
public_training_data=None,
training_datasets=training_data,
)
tasks = mteb.get_tasks(tasks=[
'TERRa',
'AILAStatutes',
'STS22'
],
# languages = ['eng-Latn'],
# exclusive_language_filter=True,
)
model = Qwen3_Embedding_0B6.load_model()
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, verbosity=2, encode_kwargs={'batch_size': 1})
Qwen:
change revision/tasks
python run_mteb.py --model Qwen/Qwen3-Embedding-0.6B --model_name Qwen/Qwen3-Embedding-0.6B --precision fp16 --model_kwargs "{\"max_length\": 8192, \"attn_type\": \"causal\", \"pooler_type\": \"last\", \"do_norm\": true, \"use_instruction\": true, \"instruction_template\": \"Instruct: {}\nQuery:\", \"instruction_dict_path\": \"task_prompts.json\", \"attn_implementation\":\"flash_attention_2\", \"revision\":\"99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f\"}" --run_kwargs "{\"save_predictions\": \"true\"}" --batch_size 1 --tasks "STS22"
Additional information
No response
Are you interested to contribute a fix for this bug?
No