This project aims to optimize LLaMA model for visual information understanding like GPT-4 and further explore the potentional of large language model.
Generally, we use CLIP vision encoder to extract image features, then image features are projected with MLP-based or Transformer-based connection network into text embedding dimensionality. Then, visual representation (including additional special tokens [boi] and [eoi]) is concatenated with text representation to learn in a autoregressive manner. The framework is similar to kosmos-1 and PaLM-E.
-
Code adjustation to support for multi-modal generation. Download clip and LLaMA models from huggingface. Meantime, we test the scripts are also compatible with other LLaMA model size. Please use script
preprocess.pyto deal with the data. -
Supervised training stage: freeze llama and clip-encoder models and only optimize the connection network. In this stage, we use COCO, CC-3M and COYO-700M datasets with training scripts
train.py. We provide the training hyper-parameter used in our experiemnts on A100 GPU(80G). We also evaluate the image captioning performance in COCO testing set.Argument Values batch size1 * 8 * 8 epochs3 cut length256 learning rate4e-3 image sequence length10 -
Instructing tuning stage: fine-tuning full model with mixed VQA and language-only instructing dataset. We use lora strategy to optimize the entire model with fine-tuning scripts
finetune.py.Argument Values batch size1024 epochs3 cut length256 learning rate2e-5 image sequence length10 -
Open source trained ckpt on huggingface and gradio interface for multi-model generation.
