This repository contains training code for the Gemamba multimodal language model.
Gemamba is the first multimodal LLM to combine a Mamba-based video encoder with performant and flexible Gemma transformer LLM in a LLaVA-style architecture.
We recommend using Dev Containers to create the environment using pre-made configuration.
-
Install PyTorch.
-
Install Python dependencies.
pip3 install -r requirements.txt
Install VideoMamba dependencies:
pip3 install -e llava/model/multimodal_encoder/videomamba/causal-conv1d
pip3 install -e llava/model/multimodal_encoder/videomamba/mamba
[optional] Update transformers to get Phi3 support:
pip3 install git+https://github.com/huggingface/transformers
- Download pretrained weights for VideoMamba:
wget https://huggingface.co/OpenGVLab/VideoMamba/resolve/main/videomamba_m16_25M_f8_res224.pth
- Refer to
run_finetune.ipynb
to learn how to load a checkpoint and run inference.
Pretrained checkpoint for the model can be found here: HF 🤗.
- The model's projector has been pretrained for 1 epoch on the Valley dataset.
- LLM and the projector have been jointly fine-tuned using the Video-ChatGPT dataset.
We inherit most of the training workflow from the original LLaVA. Please refer to scripts/train
to see configurations used for training the model. See scripts/eval
for scripts used to calculate benchmark scores.