MV-MATH🔥: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

🌟 This is the official repository for the paper "MV-MATH: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts", which contains both evaluation code and data for the MV-MATH benchmark.

[🌐 Homepage] [🤗 Huggingface Dataset] [📊 Leaderboard ] [🔍 Visualization] [📖 ArXiv Paper]

💥 News

[2025-07-21] Qwen-VL-Max achieves a strong 42.4%, Seed1.5-VL (thinking) achieves a stunning 72.9% on MV-MATH, setting a new SOTA. 🎉 Congratulations!
[2025-03-01] See this page for the homepage pf MV-MATH
[2025-03-01] O1-like model QVQ-72B-Preview achieves 29.3%, establishing itself as the new best-performing open-sourced model. 🎉 Congratulations!
[2025-02-27] Our dataset is now accessible at huggingface.
[2025-02-27] The top-performing model, Claude-3.5-Sonnet only scores 33.9% on MV-MATH, while human performance is around 76%.
[2025-02-27] MV-MATH is accepted by CVPR2025! 🎉

👀 Introduction

MV-MATH is a meticulously annotated dataset designed to evaluate the mathematical reasoning capabilities of MLLMs in multi-visual contexts. Each sample in MV-MATH consists of interleaved multi-image and text. It comprises 2,009 multi-image questions, with some questions containing up to 8 images. It includes three types: multiple-choice, free-form and multi-step questions.

MV-MATH is organized into 11 subjects over 3 difficulty levels, including Analytic Geometry, Algebra, Metric Geometry, Combinatorics, Transformation Geometry, Logic, Solid Geometry, Arithmetic, Combinatorial Geometry, Descriptive Geometry and Statistics, covering a range of scenarios from the K-12 mathematics curriculum.

Based on image relevance, we categorize MV-MATH into two subsets: a mutually dependent set (MD), where images are interrelated and understanding one image necessitates information from another; and an independent set (ID), where images are unrelated and can be interpreted independently without reference to other images.

The accuracies of 6 prominent Multimodal Large Multimodal Models (MLMMs) are evaluated on our proposed MV-MATH across 11 subjects.

Through extensive experimentation, we unveil a notable performance gap between current MLMMs and human performance on MV-MATH, underscoring the imperative for further advancements in MLMMs.

You can refer to our project homepage and the paper for more details.

📐 Dataset Examples

Some examples of MV-MATH on three subjects: analytic geometry, topology, and graph theory.

Solid Geometry

Analytic Geometry

Algebra

You can refer to the Appendix A.4 of the paper for example images of 11 subjects.

🏆 Leaderboard

The leaderboard is available here.

📈 Evaluation

Generating Outputs of Different Models

API MODEL

python models/API_model.py

This will run the GPT-4o/Claude-3.5-Sonnet/Gemini-1.5-pro/GPT-4v API and save the outputs to ./API_name.jsonl path. You can modify the system prompt, max tokens, etc.

Claude_with_caption

Generate image captions using Claude-3.5-Sonnet:

python models/Caption_Claude.py

Then you can use the generated merge data and image to inference.

Evaluation of Model Outputs

Once all the model outputs have been generated, execute the python evaluation/evaluate_choice.py python evaluation/evaluate_freeform.py python evaluation/evaluate_freeform.pyfunction to assess these outputs, and the run python evaluation/merge_score.py. This script will examine all outputs located in the outputs/ directory, computing overall accuracy.

You can refer to the Appendix H of the paper for some evaluation results of the above models and case study.

📝 Citation

If you find this benchmark useful in your research, please consider citing this BibTex:

@inproceedings{wang2025mv,
  title={Mv-math: Evaluating multimodal math reasoning in multi-visual contexts},
  author={Wang, Peijie and Li, Zhong-Zhi and Yin, Fei and Ran, Dekang and Liu, Cheng-Lin},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={19541--19551},
  year={2025}
}

🧠 Related Work

[Survey🔥🔥] From System 1 to System 2: A Survey of Reasoning Large Language Models
[CMMaTH🔥🔥] CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
[GeoEval🔥🔥] GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
[Math-Vision🔥] Measuring Multimodal Mathematical Reasoning with the MATH-Vision Dataset
[MathVerse🔥] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
[MathVista🔥] MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
data		data
evaluation		evaluation
images		images
models		models
outputs		outputs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MV-MATH🔥: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

💥 News

👀 Introduction

📐 Dataset Examples

🏆 Leaderboard

📈 Evaluation

Generating Outputs of Different Models

API MODEL

Claude_with_caption

Evaluation of Model Outputs

📝 Citation

🧠 Related Work

About

Uh oh!

Releases

Packages

Languages

License

eternal8080/MV-MATH

Folders and files

Latest commit

History

Repository files navigation

MV-MATH🔥: Evaluating Multimodal Math Reasoning in Multi-Visual Contexts

💥 News

👀 Introduction

📐 Dataset Examples

🏆 Leaderboard

📈 Evaluation

Generating Outputs of Different Models

API MODEL

Claude_with_caption

Evaluation of Model Outputs

📝 Citation

🧠 Related Work

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages