Description
Qwen2.5-Coder-instruct-3B only achieves a pass@1 score of 45.12 on HumanEval and 30.20 on MBPP using the well-known benchmark platform OpenCompass, which is significantly lower than the reported 84.1 and 73.6 in the Qwen2.5-Coder technical report (Table 16).
The logs are attached.
Could you clarify the possible reasons for this discrepancy?
run-humaneval-0.0-Qwen2.5-Coder-3B-Instruct.txt
run-mbpp-0.0-Qwen2.5-Coder-3B-Instruct.txt
Commands:
nohup python -u run.py --datasets humaneval_passk_gen_8e312c --hf-type base --hf-path ../models/Qwen2.5-Coder-3B-Instruct --max-out-len 512 --generation-kwargs do_sample=False temperature=0.0 --debug > run-humaneval-0.0-Qwen2.5-Coder-3B-Instruct.out 2>&1 &
nohup python -u run.py --datasets mbpp_passk_gen_830460 --hf-type base --hf-path ../models/Qwen2.5-Coder-3B-Instruct --max-out-len 512 --generation-kwargs do_sample=False temperature=0.0 --debug > run-mbpp-0.0-Qwen2.5-Coder-3B-Instruct.out 2>&1 &