RoboVQA Evaluation

Hi, thank you for your excellent work!

I'm currently evaluating models on the RoboVQA benchmark and noticed that in your code (e.g., [judge_robovqa](https://github.com/lmzpai/roboMamba/blob/main/src/test.py)), you implemented a customized BLEU evaluation scheme that:

- Applies jieba tokenization to Chinese answers,
- Performs max-over-truncated-prediction (±5 window),
- Computes BLEU-1 to BLEU-4 separately using sentence_bleu.

Meanwhile, the official RoboVQA evaluation protocol (e.g., [robovqa_aggregate_results](https://github.com/robo-vqa/robovqa-eval)) seems to use the standard `sacrebleu.sentence_bleu(pred, [ref])` without these additional steps.

I’m curious about the motivation behind adopting this alternative evaluation strategy. Do you consider the official metric too strict or less suitable for certain types of answers? Or do you find that your version better captures the intended model behavior?

I'd really appreciate any clarification on this, as I’m trying to ensure my evaluation results are consistent and fairly comparable. Thank you again for your contributions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RoboVQA Evaluation #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RoboVQA Evaluation #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions