Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation

Congratulations on the impressive work!

I would like to suggest expanding the evaluation of visual reasoning to the **HumanEval-V** benchmark. This benchmark provides a more challenging set of tasks by introducing **complex diagrams** paired with **coding challenges**. Unlike traditional visual reasoning tasks that focus on answering multiple-choice questions or providing short answers, HumanEval-V requires models to **generate code** based on visual input, which better tests both **instruction-following** and **open-ended generation abilities**.

Key points for consideration:
- **HumanEval-V** expands the reasoning scenarios with **complex diagrams**, pushing the limits of visual understanding.
- The task format is tailored to **code generation**, making it a suitable benchmark for testing MLLMs’ ability to handle more structured, generative tasks.
- Evaluating this benchmark will provide valuable insights into how well it handles **visual reasoning combined with coding**, which can be evaluated and rewarded through execution feedback.

You can find more information about the benchmark here: [HumanEval-V Homepage](https://humaneval-v.github.io/).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #25

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal to Evaluate on the HumanEval-V Benchmark for Enhanced Visual Reasoning and Code Generation #25

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions