Skip to content

Releases: haotian-liu/LLaVA

Release v1.2.0 (LLaVA-1.6)

31 Jan 06:07
Compare
Choose a tag to compare

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.

Release v1.1.3 (Bring your own data, LoRA training)

26 Oct 20:40
Compare
Choose a tag to compare

Updates

  • Support LoRA for the instruction tuning stage of LLaVA-1.5 -- comparable performance to full-model finetuning, and reduced requirements on GPU VRAM. (ckpts/logs, script)
  • Bring your own data and finetune LLaVA-1.5 to your own task. (instruction)
  • Basic support for Windows. (instruction)
  • Fix: the training behavior with gradient accumulation is the same as large-batch training.

Notes

  • A new LoRA schedule for LLaVA-1.5 is used,
    • rank: 128
    • alpha: 256
    • lr (LoRA): 2e-4
    • lr (projector): 2e-5

Release v1.1.1

12 Oct 00:17
Compare
Choose a tag to compare

In this version, we release the training scripts, data, and evaluation scripts on benchmarks for LLaVA 1.5. Bake your LLaVA today!

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo!

Release v1.1.0

08 Oct 14:01
Compare
Choose a tag to compare

🔥 LLaVA-1.5 is out! This release supports LLaVA-1.5 model inference and serving.
We will release the training scripts, data, and evaluation scripts on benchmarks in the coming week.

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo, with training and evaluation scripts coming in the next week!

Release v1.0.2

05 Sep 18:30
Compare
Choose a tag to compare
  • Added model zoo
  • Improved support for ScienceQA with latest training configurations
  • Improved docs

We are working to continue improving the documentation. Please let us know if you find any documentation unclear, thanks!

Release v1.0.1

30 Jul 01:51
Compare
Choose a tag to compare
  • Added LLaMA-2 support
  • Full LoRA support. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, without the need of CPU offloading)
  • A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
  • Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
  • Ablate and clean up some design choices to make the training simpler and smoother
  • Full DeepSpeed support
  • Improved model checkpoint saving during pretraining stage to save disk space
  • Improved WebUI interface
  • Improved support for inference with multiple-GPUs
  • Support inference with 4-bit and 8-bit quantization
  • Support interactive CLI inference

We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.

We hope this release further benefits the community and makes large multimodal models more accessible.

Detailed Changes

  • Tokenization. We remove the dependency of the additional tokens (<IM_START>, <IM_END>, <IM_PATCH>), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.
  • Prompt.
    • Pretraining. We simplified the pretraining prompts by removing additional instructions like Describe the image details, which we find to allow the zero-shot inference and can slightly improve the training speed.
    • We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.