Skip to content

om-ai-lab/VLM-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VLM-R1: A stable and generalizable R1-style Large Vision-Language Model

πŸŽ‰ Our VLM-R1 Math model reaches the top of the Open-Compass Math Leaderboard (under 4B parameters).

Since the introduction of Deepseek-R1, numerous works have emerged focusing on reproducing and improving upon it. In this project, we propose VLM-R1, a stable and generalizable R1-style Large Vision-Language Model.

Specifically, for the task of Referring Expression Comprehension (REC), we trained Qwen2.5-VL using both R1 and SFT approaches. The results reveal that, on the in-domain test data, the performance of the SFT model is slightly lower than that of the R1 model (as shown at the left of the figure below). However, on the out-of-domain test data, the SFT model’s performance deteriorates significantly as the number of steps increases, while the R1 model shows a steady improvement (as shown at the right of the figure below).

image

πŸš€ Features

This repository supports:

πŸ—žοΈ Update

  • 2025-03-17: Our VLM-R1 Math model reaches the top of the Open-Compass Math Leaderboard (under 4B parameters). We have released the checkpoint.
  • 2025-03-15: We support multi-image input data. Check the format of multi-image input here. We also provide an example of multi-image script run_grpo_gui.sh, see here for details.
  • 2025-03-13: We support InternVL for GRPO. See run_grpo_rec_internvl.sh for details. The annotation json files used in InternVL are here. If you want to add your new model, please refer to How to add a new model.
  • 2025-03-02: We support LoRA Fine-tuning for GRPO. See run_grpo_rec_lora.sh for details.
  • 2025-02-27: We support the number of iterations per batch and epsilon value for clipping in the original GRPO algorithm with args: --num_iterations and --epsilon.
  • 2025-02-25: We support multi-node training for GRPO. See multinode_training_demo.sh for details.
  • 2025-02-21: We release the checkpoint of the VLM-R1 REC model.
  • 2025-02-20: We release the script for general data loading.
  • 2025-02-19: We incorporate an explanation of the SFT method.
  • 2025-02-17: We release the VLM-R1 REC Demo on Hugging Face Spaces.
  • 2025-02-15: We release the VLM-R1 repository and GRPO training script.

πŸ€– Models

  • Math: Through VLM-R1 training, our math model focuses on multimodal reasoning tasks and has achieved Top1 on the OpenCompass Multi-modal Reasoning Leaderboard among models < 4B.
  • REC: Trained with VLM-R1, our Referring Expression Comprehension (REC) model showcases the superior performance on out-of-domain data and a series of reasoning-grounding tasks.
Version Base VLM Checkpoint Task Type
VLM-R1-Qwen2.5VL-3B-Math-0305 Qwen2.5VL-3B omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Multi-Modal Math
VLM-R1-Qwen2.5VL-3B-REC-500steps Qwen2.5VL-3B omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps REC/Reasoning-Grounding

🎯 ToDo

  • Implement multi-node training.
  • Implement LoRA Fine-tuning.
  • Support more Multimodal LLMs.
  • Support multi-image input.
  • Release the VLM-R1 Math model.
  • Release the blog of VLM-R1.
  • Study cross task generalization.
  • Enhance VLM for other tasks [welcome issue].

πŸ› οΈ Setup

conda create -n vlm-r1 python=3.10
conda activate vlm-r1
bash setup.sh

πŸ’ͺ🏻 Training

Referring Expression Comprehension (REC)

πŸ“š GRPO

  1. Download the COCO Train2014 image and unzip it, and we refer to the image dir as <your_image_root>.

  2. Download the RefCOCO/+/g and RefGTA Annotation files and unzip it (RefGTA is used for out-of-domain evaluation).

  3. Write the path of the annotation files in the src/open-r1-multimodal/data_config/rec.yaml file.

datasets:
    - json_path: /path/to/refcoco_train.json
    - json_path: /path/to/refcocop_train.json
    - json_path: /path/to/refcocog_train.json
  1. bash src/open-r1-multimodal/run_scripts/run_grpo_rec.sh

Note

If you encounter 'CUDA out of memory' error, you can try to (1) set gradient_checkpointing as true, (2) reduce the per_device_train_batch_size, or (3) use lora.

cd src/open-r1-multimodal

torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12346" \
    src/open_r1/grpo_rec.py \
    --deepspeed local_scripts/zero3.json \
    --output_dir output/$RUN_NAME \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --dataset_name data_config/rec.yaml \
    --image_root <your_image_root> \
    --max_prompt_length 1024 \
    --num_generations 8 \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --torch_dtype bfloat16 \
    --data_seed 42 \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --num_train_epochs 2 \
    --run_name $RUN_NAME \
    --save_steps 100 \
    --save_only_model true \
    --freeze_vision_modules false # If you want to only finetune the language model, set this to true.

πŸ“š Multi-Node GRPO

For multi-node training, please refers to multinode_training_demo.sh.

πŸ“š SFT

We use LLaMA-Factory to train the SFT model.

  1. Clone the LLaMA-Factory repository and install the dependencies.
git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
  1. Download the dataset_info.json, mllm_rec_json.json, and qwen2_5_vl_full_sft.yaml we provided here. Put the json files in the LLaMA-Factory/data directory and the yaml file in the LLaMA-Factory/examples/train_full directory.

  2. Run the following command to train the SFT model.

llamafactory-cli train examples/train_full/qwen2_5_vl_full_sft.yaml

For your own data

We also support data loading the jsonl data of this format in src/open-r1-multimodal/src/open_r1/grpo_jsonl.py. Please note that you may need to use different reward functions for your specialized tasks. Welcome to PR to add your own reward functions or share any other interesting findings!

The jsonl has the format as follows:

{
  "id": 1,
  "image": "Clevr_CoGenT_TrainA_R1/data/images/CLEVR_trainA_000001_16885.png",
  "conversations": [
    {"from": "human", "value": "<image>What number of purple metallic balls are there?"},
    {"from": "gpt", "value": "0"}
  ]
}

If you want to use multi-image input, you can use the following format:

{
  "id": 1,
  "image": ["Clevr_CoGenT_TrainA_R1/data/images/CLEVR_trainA_000001_16885.png", "Clevr_CoGenT_TrainA_R1/data/images/CLEVR_trainA_000001_16886.png"],
  "conversations": [
    {"from": "human", "value": "<image><image>What number of purple metallic balls in total in the two images?"},
    {"from": "gpt", "value": "3"}
  ]
}

Note: The image path in the jsonl file should be relative to the image folder specified in --image_folders. The absolute path of the input image is constructed as os.path.join(image_folder, data['image']). For example:

  • If your jsonl has "image": "folder1/image1.jpg"
  • And you specify --image_folders "/path/to/images/"
  • The full image path will be /path/to/images/folder1/image1.jpg

Multiple data files and image folders can be specified using ":" as a separator:

--data_file_paths /path/to/data1.jsonl:/path/to/data2.jsonl \
--image_folders /path/to/images1/:/path/to/images2/

The script can be run like this:

torchrun --nproc_per_node="8" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
  src/open_r1/grpo_jsonl.py \
    --output_dir output/$RUN_NAME \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --deepspeed local_scripts/zero3.json \
    --dataset_name <your_dataset_name> \
    --data_file_paths /path/to/your/data.jsonl \ # can be multiple, separated by ":"
    --image_folders /path/to/your/image/folder/ \ # can be multiple, separated by ":"
    ...

We provide an example of multi-image script run_grpo_gui.sh. This task requires the model to analyze two GUI screenshots, taken before and after a user action, to determine if any UI interaction defects are present, which is from GUI-Testing-Arena. Download the image and unzip it into the /path/to/images/. Then modify the image_folders parameter in the script and run it.

bash src/open-r1-multimodal/run_scripts/run_grpo_gui.sh

πŸ“Š Evaluation

image

  1. Download the provided RefGTA images.
cd ./src/eval

# Remember to change the model path, image root, and annotation path in the script
python test_rec_r1.py # for GRPO
python test_rec_baseline.py # for SFT

🀝 Acknowledgements

We would like to express our sincere gratitude to DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V, RefCOCO, and RefGTA for providing open-source resources that contributed to the development of this project.

⭐️ Citation

If you find this project useful, welcome to cite us.

@misc{shen2025vlmr1,
  author       = {Shen, Haozhan and Zhang, Zilun and Zhao, Kangjia and Zhang, Qianqian and Xu, Ruochen and Zhao, Tiancheng},
  title        = {VLM-R1: A stable and generalizable R1-style Large Vision-Language Model},
  howpublished = {\url{https://github.com/om-ai-lab/VLM-R1}},
  note         = {Accessed: 2025-02-15},
  year         = {2025}
}