M5 - A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
This repository contains the code for the M5 Benchmark paper.
- 2024-11: 🌴 We happily presented the M5 Benchmark at EMNLP in Miami! It is now available on the ACL Anthology!
- 2024-10: 👷 We are currently working on M5 Benchmark v2 with a cleaner and more extendable codebase, support for more models, and additional datasets.
- 2024-10: 🤗 We released the datasets used in the M5 Benchmark on HuggingFace
- 2024-10: ⭐️ We publicly released the code for the M5 Benchmark.
- 2024-09: ⭐️ Our paper got accepted at EMNLP (Findings) 2024, Miami, FL, USA.
- 2024-07: 📝 We released the first preprint of the M5 Benchmark paper.
Note that all code was tested only on Debian-based systems (Ubuntu 20.04 and 22.04) with CUDA 11.8 pre-installed.
To setup the environments, run the setup.sh script:
./setup.shThis takes a long time (!) and will:
- Install the mambapackage manager if not already installed.
- Create the following mamba environments:
- m5bfor the majority of the models
- m5b-cogvlmfor the CogVLM models
- m5b-yivlfor the Yi-VL models. Note that this also downloads the model weights.
- m5b-llavafor the original LlaVA models from Haotian Liu
- m5b-omnilmmfor the OmniLMM models
- m5b-qwenvlfor the QwenVL models
 
- Tests each environment to ensure that it is correctly setup.
You can also install single environments by running the following command:
./setup.sh mamba-env-create <env_name>where <env_name> is one of the environments listed above.
If the installation was successful, you should see a message similar to the following for each environment:
##################### PYTHON ENV 'm5b-llava' INFO START #####################
Python version: 3.10.15 | packaged by conda-forge | (main, Jun 16 2024, 01:24:24) [GCC 13.3.0]
PyTorch version: 2.1.2
CUDA available: True
CUDA version: 11.8
CUDA devices: 5
Flash Attention 2 Support: True
Transformers version: 4.37.2
Datasets version: 3.0.1
Lightning version: 2.4.0
##################### PYTHON ENV 'm5b-llava' INFO END #####################To speed up the evaluation process, the datasets of the M5 Benchmark are stored as WebDataset archives. You can download all datasets as a single 46G tar-archive using the following command:
curl https://ltdata1.informatik.uni-hamburg.de/m5b/m5b_datasets.tar -o /path/to/the/m5b/datasets/m5b_datasets.tar \
&& tar -xvf /path/to/the/m5b/datasets/m5b_datasets.tarBecause different models require different environments, you have to activate the correct environment before running the benchmark.
| Model ID | Environment | 
|---|---|
| 🤗/openbmb/MiniCPM-V | m5b | 
| 🤗/Gregor/mblip-mt0-xl | m5b | 
| 🤗/Gregor/mblip-bloomz-7b | m5b | 
| 🤗/llava-hf/bakLlava-v1-hf | m5b | 
| 🤗/llava-hf/llava-1.5-7b-hf | m5b | 
| 🤗/llava-hf/llava-1.5-13b-hf | m5b | 
| 🤗/OpenGVLab/InternVL-Chat-Chinese-V1-1 | m5b | 
| 🤗/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus | m5b | 
| 🤗/liuhaotian/llava-v1.6-vicuna-7b | m5b-llava | 
| 🤗/liuhaotian/llava-v1.6-vicuna-13b | m5b-llava | 
| 🤗/liuhaotian/llava-v1.6-34b | m5b-llava | 
| 🤗/THUDM/cogvlm-chat-hf | m5b-cogvlm | 
| 🤗/01-ai/Yi-VL-6B | m5b-yivl | 
| 🤗/01-ai/Yi-VL-34B | m5b-yivl | 
| 🤗/openbmb/OmniLMM-12B | m5b-omnilmm | 
| 🤗/Qwen/Qwen-VL-Chat | m5b-qwenvl | 
| gpt-4-turbo-2024-04-09 | m5b | 
| gpt-4-1106-vision-preview | m5b | 
| gpt-4-turbo | m5b | 
| gpt-4-vision-preview | m5b | 
| gemini-pro-vision | m5b | 
The M5 Benchmark consists of the following datasets: marvl xgqa xm3600 xvnli maxm xflickrco m5b_vgr m5b_vlod
| Dataset ID | 🤗 | Name | 
|---|---|---|
| marvl | 🤗/floschne/marvl | MaRVL: Multicultural Reasoning over Vision and Language | 
| xgqa | 🤗/floschne/xgqa | xGQA: Cross-Lingual Visual Question Answering | 
| xm3600 | 🤗/floschne/xm3600 | Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset | 
| xvnli | 🤗/floschne/xvnli | Zero-Shot Cross-Lingual Visual Natural Language Inference | 
| maxm | 🤗/floschne/maxm | MaXM: Towards Multilingual Visual Question Answering | 
| xflickrco | 🤗/floschne/xflickrco | xFlickrCOCO | 
| m5b_vgr | 🤗/floschne/m5b_vgr | M5B Visually Grounded Reasoning | 
| m5b_vlod | 🤗/floschne/m5b_vlod | M5B Visual Outlier Detection | 
First, activate the correct environment for the model you want to evaluate as described above. Then, in the project root, run the following command:
PYTHONPATH=${PWD}/src python src/m5b/scripts/eval.py \
    --model_id="<model_id>" \
    --dataset="<dataset_id>" \ 
    --data_base_path=/path/to/the/m5b/datasetswhere <model_id> is the model ID from the models table above and <dataset_id> is the dataset ID from the datasets table above.
If you use this code or the M5 Benchmark in your research, please cite the following paper:
@inproceedings{
    schneider2024m5benchmark,
    title={M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks},
    author={Schneider, Florian and Sitaram, Sunayana},
    booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
    address={Miami, Florida, USA},
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.250",
    pages={4309--4345},
    year={2024}
}