Skip to content

Release v1.0.1

Compare
Choose a tag to compare
@haotian-liu haotian-liu released this 30 Jul 01:51
· 251 commits to main since this release
  • Added LLaMA-2 support
  • Full LoRA support. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, without the need of CPU offloading)
  • A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
  • Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
  • Ablate and clean up some design choices to make the training simpler and smoother
  • Full DeepSpeed support
  • Improved model checkpoint saving during pretraining stage to save disk space
  • Improved WebUI interface
  • Improved support for inference with multiple-GPUs
  • Support inference with 4-bit and 8-bit quantization
  • Support interactive CLI inference

We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.

We hope this release further benefits the community and makes large multimodal models more accessible.

Detailed Changes

  • Tokenization. We remove the dependency of the additional tokens (<IM_START>, <IM_END>, <IM_PATCH>), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.
  • Prompt.
    • Pretraining. We simplified the pretraining prompts by removing additional instructions like Describe the image details, which we find to allow the zero-shot inference and can slightly improve the training speed.
    • We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.