Qwen2.5-Coder-instruct-3b only achieves 45.12 pass@1 on humaneval using Opencompass

Qwen2.5-Coder-instruct-3B only achieves a pass@1 score of 45.12 on HumanEval and 30.20 on MBPP using the well-known benchmark platform [OpenCompass](https://github.com/open-compass/opencompass), which is significantly lower than the reported 84.1 and 73.6 in the Qwen2.5-Coder technical report (Table 16).
The logs are attached.

Could you clarify the possible reasons for this discrepancy?

[run-humaneval-0.0-Qwen2.5-Coder-3B-Instruct.txt](https://github.com/user-attachments/files/20815525/run-humaneval-0.0-Qwen2.5-Coder-3B-Instruct.txt)
[run-mbpp-0.0-Qwen2.5-Coder-3B-Instruct.txt](https://github.com/user-attachments/files/20815524/run-mbpp-0.0-Qwen2.5-Coder-3B-Instruct.txt)

#### Commands:

nohup python -u run.py --datasets humaneval_passk_gen_8e312c --hf-type base --hf-path ../models/Qwen2.5-Coder-3B-Instruct --max-out-len 512 --generation-kwargs do_sample=False temperature=0.0 --debug > run-humaneval-0.0-Qwen2.5-Coder-3B-Instruct.out 2>&1 &

nohup python -u run.py --datasets mbpp_passk_gen_830460 --hf-type base --hf-path ../models/Qwen2.5-Coder-3B-Instruct --max-out-len 512 --generation-kwargs do_sample=False temperature=0.0 --debug > run-mbpp-0.0-Qwen2.5-Coder-3B-Instruct.out 2>&1 &

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen2.5-Coder-instruct-3b only achieves 45.12 pass@1 on humaneval using Opencompass #420

Commands:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen2.5-Coder-instruct-3b only achieves 45.12 pass@1 on humaneval using Opencompass #420

Description

Commands:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions