- [2025/04/13] We have updated the inference script that does not rely on VLMEvalKit, based on the suggestion in this issue.
- [2025/01/08] We released the full training code.
- [2025/01/02] We discovered that when testing with the AI2D benchmark, we were using AI2D_TEST_NO_MASK, while the VLMEvalKit utilizes AI2D_TEST. We previously overlooked the distinction between the two, and we sincerely apologize for this oversight. We will make the necessary corrections.
- [2024/11/28] We've released the dataset: https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
- [2024/11/25] We've released the code for dataset generation: dataset_generation/generate.py
- [2024/11/23] We've released the Gradio App: https://huggingface.co/spaces/Xkev/Llama-3.2V-11B-cot
- [2024/11/21] LLaVA-o1 is renamed to LLaVA-CoT https://arxiv.org/abs/2411.10440v2.
- [2024/11/20] We've released the pretrained weights: https://huggingface.co/Xkev/Llama-3.2V-11B-cot
- [2024/11/18] We've released our paper: https://arxiv.org/abs/2411.10440
- [2024/11/18] Welcome to watch 👀 this repository for the latest updates.
LLaVA-CoT is a visual language model capable of spontaneous, systematic reasoning.
Our 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six challenging multimodal benchmarks.
LLaVA-CoT begins by outlining the problem, interprets relevant information from the image, proceeds step-by-step through reasoning, and ultimately reaches a well-supported conclusion.
You can download the pretrained weights from the Huggingface: Xkev/Llama-3.2V-11B-cot.
You can download the dataset from the Huggingface: Xkev/LLaVA-CoT-100k.
You can use the same code as Llama-3.2-11B-Vision-Instruct to load the model and perform inference.
If you want to use perform inference time scaling, you can refer to the detailed instructions provided in this file.
You may use any repository that supports Llama-3.2-11B-Vision-Instruct for finetuning.
We recommend using llama-recipes.
To reproduce our results, you can use the following command:
cd train
pip install llama-recipes
torchrun --nnodes 1 --nproc_per_node 8 --master_port 29500 finetuning.py --enable_fsdp --lr 1e-5 --num_epochs 3 --batch_size_training 4 --model_name meta-llama/Llama-3.2-11B-Vision-Instruct --dist_checkpoint_root_folder ./finetuned_model --dist_checkpoint_folder LLaVA-CoT --use_fast_kernels --dataset "custom_dataset" --custom_dataset.test_split "test" --custom_dataset.file "datasets/cot_dataset.py" --run_validation False --batching_strategy padding
Remember to modify the data_path
and image_base_path
in train/cot_dataset.py
to your own path (the path to the training dataset).
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@misc{xu2024llavacot,
title={LLaVA-CoT: Let Vision Language Models Reason Step-by-Step},
author={Guowei Xu and Peng Jin and Hao Li and Yibing Song and Lichao Sun and Li Yuan},
year={2024},
eprint={2411.10440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.10440},
}
- The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
- The service is a research preview intended for non-commercial use only, subject to LLAMA 3.2 COMMUNITY LICENSE AGREEMENT, and Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations.
- The template is modified from Chat-Univi and LLaVA.