Skip to content

RoboVQA Evaluation #27

Open
Open
@LongXinKou

Description

@LongXinKou

Hi, thank you for your excellent work!

I'm currently evaluating models on the RoboVQA benchmark and noticed that in your code (e.g., judge_robovqa), you implemented a customized BLEU evaluation scheme that:

  • Applies jieba tokenization to Chinese answers,
  • Performs max-over-truncated-prediction (±5 window),
  • Computes BLEU-1 to BLEU-4 separately using sentence_bleu.

Meanwhile, the official RoboVQA evaluation protocol (e.g., robovqa_aggregate_results) seems to use the standard sacrebleu.sentence_bleu(pred, [ref]) without these additional steps.

I’m curious about the motivation behind adopting this alternative evaluation strategy. Do you consider the official metric too strict or less suitable for certain types of answers? Or do you find that your version better captures the intended model behavior?

I'd really appreciate any clarification on this, as I’m trying to ensure my evaluation results are consistent and fairly comparable. Thank you again for your contributions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions