Description
Hi, thank you for your excellent work!
I'm currently evaluating models on the RoboVQA benchmark and noticed that in your code (e.g., judge_robovqa), you implemented a customized BLEU evaluation scheme that:
- Applies jieba tokenization to Chinese answers,
- Performs max-over-truncated-prediction (±5 window),
- Computes BLEU-1 to BLEU-4 separately using sentence_bleu.
Meanwhile, the official RoboVQA evaluation protocol (e.g., robovqa_aggregate_results) seems to use the standard sacrebleu.sentence_bleu(pred, [ref])
without these additional steps.
I’m curious about the motivation behind adopting this alternative evaluation strategy. Do you consider the official metric too strict or less suitable for certain types of answers? Or do you find that your version better captures the intended model behavior?
I'd really appreciate any clarification on this, as I’m trying to ensure my evaluation results are consistent and fairly comparable. Thank you again for your contributions!