Gemamba

This repository contains training code for the Gemamba multimodal language model.

Gemamba is the first multimodal LLM to combine a Mamba-based video encoder with performant and flexible Gemma transformer LLM in a LLaVA-style architecture.

Getting started

We recommend using Dev Containers to create the environment using pre-made configuration.

Install PyTorch.
Install Python dependencies.

pip3 install -r requirements.txt

Install VideoMamba dependencies:

pip3 install -e llava/model/multimodal_encoder/videomamba/causal-conv1d
pip3 install -e llava/model/multimodal_encoder/videomamba/mamba

[optional] Update transformers to get Phi3 support:

pip3 install git+https://github.com/huggingface/transformers

Download pretrained weights for VideoMamba:

wget https://huggingface.co/OpenGVLab/VideoMamba/resolve/main/videomamba_m16_25M_f8_res224.pth

Refer to run_finetune.ipynb to learn how to load a checkpoint and run inference.

Pretrained checkpoints

Pretrained checkpoint for the model can be found here: HF 🤗.

The model's projector has been pretrained for 1 epoch on the Valley dataset.
LLM and the projector have been jointly fine-tuned using the Video-ChatGPT dataset.

Training

We inherit most of the training workflow from the original LLaVA. Please refer to scripts/train to see configurations used for training the model. See scripts/eval for scripts used to calculate benchmark scores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!