Skip to content

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #25

@zfj1998

Description

@zfj1998

Congratulations on the impressive work!

I would like to suggest expanding the evaluation of visual reasoning to the HumanEval-V benchmark. This benchmark provides a more challenging set of tasks by introducing complex diagrams paired with coding challenges. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to generate code based on visual input, which better tests both instruction-following and open-ended generation abilities.

Key points for consideration:

  • HumanEval-V expands the reasoning scenarios with complex diagrams, pushing the limits of visual understanding.
  • The task format is tailored to code generation, making it a suitable benchmark for testing MLLMs’ ability to handle more structured, generative tasks.
  • Evaluating this benchmark will provide valuable insights into how well it handles visual reasoning combined with coding, which can be evaluated and rewarded through execution feedback.

You can find more information about the benchmark here: HumanEval-V Homepage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions