Skip to content

Can not reproduce Qwen3-Embedding results #2907

@dzaripov

Description

@dzaripov

Describe the bug

I am attempting to evaluate the new Qwen3-Embedding models on ruMTEB but have been unable to reproduce the published scores.

Used MTEB evaluation code and one from Qwen team

As scores were commited on 4th of june and actual version tied to the scores was released on 6th of june (b22da495047858cce924d27d76261e96be6febc0 huggingface commit), I added previous version too for comparison (99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f)

Model addition PR: #2769
Scores addition PR: embeddings-benchmark/results#214

The scores I obtained for Qwen3-Embedding-0.6B:

dataset original main score mteb eval (b22da4) mteb eval (99cabf) qwen eval (b22da4) qwen eval (99cabf)
TERRa 0.606803 0.561885 0.558446 0.601651 0.565181
AILAStatutes 0.79018 0.72796 0.41639 0.79309 0.5805
STS22 (eng) 0.711317 0.708369 0.659296 0.708842 0.291827

Tried changing versions of transformers (4.53.4/4.52.4/4.52.1/4.51.3), sentence-transformers (4.1.0/5.0.0), MTEB (1.38.9/1.38.30/1.38.34) - no main_score difference in TERRa task by MTEB evaluation.
As in MTEB, model uses all context (32k) and in Qwen eval only 8k context it also gives ~0.01 difference in scores

Also tried FA2 with model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": torch.float16}, got no difference.

To reproduce

Code for reproduction:

MTEB:

import mteb
from mteb.models.qwen3_models import *

Qwen3_Embedding_0B6 = ModelMeta(
    loader=partial(
        q3e_instruct_loader,
        model_name_or_path=os.environ.get("Q3E_0B6_PATH", "Qwen/Qwen3-Embedding-0.6B"),
        # revision='99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f',
        revision="b22da495047858cce924d27d76261e96be6febc0",
    ),
    name="Qwen/Qwen3-Embedding-0.6B",
    languages=multilingual_langs,
    open_weights=True,
    # revision="99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f",
    revision="b22da495047858cce924d27d76261e96be6febc0",
    release_date="2025-06-05",
    n_parameters=595776512,
    memory_usage_mb=2272,
    embed_dim=1024,
    max_tokens=32768, # or 8k - 8192
    license="apache-2.0",
    reference="https://huggingface.co/Qwen/Qwen3-Embedding-0.6B",
    similarity_fn_name="cosine",
    framework=["Sentence Transformers", "PyTorch"],
    use_instructions=True,
    public_training_code=None,
    public_training_data=None,
    training_datasets=training_data,
)


tasks = mteb.get_tasks(tasks=[
        'TERRa',
        'AILAStatutes',
        'STS22'
    ],
    # languages = ['eng-Latn'],
    # exclusive_language_filter=True,
)
model = Qwen3_Embedding_0B6.load_model()

evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, verbosity=2, encode_kwargs={'batch_size': 1})

Qwen:

change revision/tasks

python run_mteb.py   --model Qwen/Qwen3-Embedding-0.6B   --model_name Qwen/Qwen3-Embedding-0.6B   --precision fp16   --model_kwargs "{\"max_length\": 8192, \"attn_type\": \"causal\", \"pooler_type\": \"last\", \"do_norm\": true, \"use_instruction\": true, \"instruction_template\": \"Instruct: {}\nQuery:\", \"instruction_dict_path\": \"task_prompts.json\", \"attn_implementation\":\"flash_attention_2\", \"revision\":\"99cabfa1346cbf4ac8b0e73079bb2e286cff3a1f\"}"   --run_kwargs "{\"save_predictions\": \"true\"}"   --batch_size 1  --tasks "STS22"

Additional information

No response

Are you interested to contribute a fix for this bug?

No

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions