Skip to content

Commit 1515745

Browse files
committed
vila1.5 release
1 parent eaadb1e commit 1515745

File tree

257 files changed

+22378
-2606
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

257 files changed

+22378
-2606
lines changed

README.md

Lines changed: 98 additions & 54 deletions
Large diffs are not rendered by default.

data_prepare/README.md

Lines changed: 88 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,34 @@
1-
21
# Data Preparation for Training VILA
32

43
To train VILA, we used the following datasets:
54

6-
| Stage | Datasets |
7-
| ----------------------- | --------------------------- |
8-
| 1. Initialize projector | CC3M |
9-
| 2. Pre-training | MMC4-core, COYO-700M subset |
10-
| 3. SFT | LLaVA-1.5, VFLAN, ShareGPT, TextFLAN |
5+
| Stage | Datasets |
6+
| ----------------------- | -------------------------------------------------------------------------------- |
7+
| 1. Initialize projector | CC3M |
8+
| 2. Pre-training | MMC4-core, COYO-700M, ShreGPT4V_pretrain |
9+
| 3. SFT | LLaVA-Next mixture, VFLAN, WIT, GSM8K-ScRel-SFT, Sherlock, ScienceQA, Shot2story, Video_ChatGPT, Youcook2, Vatex, ShareGPT_Video |
1110

12-
### LLaVa-CC3M-Pretrain
13-
We use [LLaVA-CC3M-Pretrain-595K](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) to train the visual language projector
1411

15-
```bash
16-
mkdir -p ./playground/data/LLaVA-Pretrain
17-
cd ./playground/data/LLaVA-Pretrain
1812

19-
# download chat.json and process
20-
huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K chat.json --repo-type dataset --local-dir . --local-dir-use-symlinks False
21-
mv chat.json LLaVA-CC3M-Pretrain-595K.json
2213

23-
# download images.zip and process
24-
huggingface-cli download liuhaotian/LLaVA-CC3M-Pretrain-595K images.zip --repo-type dataset --local-dir . --local-dir-use-symlinks False
25-
unzip images.zip -d images
26-
```
14+
### LLaVa-CC3M-Pretrain
15+
16+
We use [LLaVA-CC3M-Pretrain-595K](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) to train the visual language projector
2717

2818
### MMC4-Core Dataset
29-
Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
3019

31-
1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access [here](https://forms.gle/VYtcNY8aYaUANK9f8).
20+
Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
21+
22+
1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access [here](https://forms.gle/VYtcNY8aYaUANK9f8).
3223

3324
2. Now modify the input and output path in `mmc4_downloader.py` and run the following script to scrawl the MMC4 images:
25+
3426
```bash
3527
cd mmc4
3628
python mmc4_downloader.py
3729
```
38-
Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
30+
31+
Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
3932

4033
The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:
4134

@@ -59,39 +52,43 @@ python mmc4_merger.py
5952
```
6053

6154
### COYO-700M Dataset
55+
6256
1. Download the metadata of COYO-700M:
57+
6358
```bash
6459
huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
6560
```
6661

67-
2. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
62+
2. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
6863

6964
There are totally 128 shards of annotations. Now download each one with the script:
65+
7066
```bash
7167
cd coyo
7268
for SHARD in {0..127}; do
73-
python coyo_downloader.py $SHARD
69+
python coyo_downloader.py $SHARD
7470
done
7571
```
7672

7773
3. Split downloaded COYO data into multiple shards:
74+
7875
```bash
7976
python coyo_splitter.py
8077
```
8178

82-
### LLaVA-1.5 Instruction Data
79+
### LLaVA-1.5 Instruction Data
8380

8481
We use this [file](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json) in our experiments. Please download this dataset from LLaVA authors.
8582

8683
```bash
87-
mkdir -p ./playground/data/LLaVA-Pretrain
88-
cd ./playground/data/LLaVA-Pretrain
8984
huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset
9085
```
9186

92-
9387
### VFlan dataset
94-
1. Download FLAN datasets:
88+
89+
#### TextFLAN
90+
91+
1. Download FLAN datasets:
9592

9693
```bash
9794
huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False
@@ -104,7 +101,8 @@ cd sft
104101
python preprocess_flan.py
105102
```
106103

107-
### M3IT Dataset
104+
#### M3IT Dataset
105+
108106
1. Download M3IT datasets:
109107

110108
```bash
@@ -123,11 +121,68 @@ python preprocess_m3it.py
123121
python split_vflan.py
124122
```
125123

126-
### ShareGPT4v
124+
### LLaVA-Next mixture
125+
126+
You can follow this [page](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) to prepare the data mixture that is proposed by LLaVA-Next.
127+
128+
### Shot2story
129+
130+
Please follow this [page](https://github.com/bytedance/Shot2Story/blob/master/DATA.md) to download the videos. The JSON file can be downloaded with
131+
132+
```bash
133+
huggingface-cli download mit-han-lab/vila-dataset shot2story_shotonly.json
134+
--repo-type dataset --local-dir shot2story --local-dir-use-symlinks False
135+
```
136+
127137

128-
The ShareGPT data can be obtained [mit-han-lab/ShareGPT4V](https://huggingface.co/datasets/mit-han-lab/ShareGPT4V).
129-
* Note the original ShareGPT4v dataset contains some samples with file ids (sa_XXXX) and repeative response. We filter those bad examples and reduced the samples from 100K -> 96K (for caption) and 1.2m -> 1.17m (for pretraining). Then we re-combine them into a single file.
138+
### Video_ChatGPT
139+
140+
You can follow this [page](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/README.md#video-instruction-dataset-open_file_folder) to prepare Video_ChatGPT dataset.
141+
142+
### Youcook2
143+
144+
Please follow this [page](http://youcook2.eecs.umich.edu/) to download the videos. The JSON file can be downloaded with
145+
146+
```bash
147+
huggingface-cli download mit-han-lab/vila-dataset youcook_filtered_v3.json --repo-type dataset --local-dir youcook2 --local-dir-use-symlinks False
148+
```
149+
150+
### Vatex
151+
152+
Please follow this [page](https://eric-xw.github.io/vatex-website/download.html) to download the videos. The JSON file can be downloaded with
153+
154+
```bash
155+
huggingface-cli download mit-han-lab/vila-dataset vatex_filtered_v3.json --repo-type dataset --local-dir vatex --local-dir-use-symlinks False
156+
```
157+
158+
### ShareGPT_Video
159+
160+
You can follow this [page](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) to prepare ShareGPT_Video dataset.
161+
162+
### WIT
163+
164+
The original WIT data can be obtained [google-research-datasets/wit](https://github.com/google-research-datasets/wit/tree/main). \* We subsample ~538K english data from the original WIT dataset and curate a llava conversation format JSON file.
130165

131166
```bash
132-
huggingface-cli download mit-han-lab/ShareGPT4V --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
167+
huggingface-cli download mit-han-lab/vila-dataset wit_processed_538k.json --repo-type dataset --local-dir WIT --local-dir-use-symlinks False
133168
```
169+
170+
### GSM8K-ScRel-SFT
171+
172+
We add some math data [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel/blob/main/data/train_use.jsonl) to our SFT stage.
173+
174+
### Sherlock
175+
176+
The image files of Sherlock can be obtained from [VisualGenome](https://visualgenome.org/api/v0/api_home.html) and [VCR](https://visualcommonsense.com/download/) separately. The llava conversation format JSON file can be downloaded with
177+
178+
```bash
179+
huggingface-cli download mit-han-lab/vila-dataset sherlock_317k.json --repo-type dataset --local-dir sherlock --local-dir-use-symlinks False
180+
```
181+
182+
### ScienceQA
183+
184+
We use the train split of ScienceQA. The image data of the train split can be obtained from [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) or their [huggingface repo](https://huggingface.co/datasets/derek-thomas/ScienceQA). The llava conversation format JSON file can be downloaded with
185+
186+
```bash
187+
huggingface-cli download mit-han-lab/vila-dataset scienceqa_train_12k.json --repo-type dataset --local-dir scienceqa --local-dir-use-symlinks False
188+
```

demo_images/av.png

100755100644
File mode changed.

demo_trt_llm/README.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Run VILA demo on x86_64 machine
2+
3+
## Build TensorRT-LLM
4+
The first step to build TensorRT-LLM is to fetch the sources:
5+
```bash
6+
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
7+
apt-get update && apt-get -y install git git-lfs
8+
git lfs install
9+
10+
git clone https://github.com/NVIDIA/TensorRT-LLM.git
11+
cd TensorRT-LLM
12+
git checkout 66ef1df492f7bc9c8eeb01d7e14db01838e3f0bd
13+
git submodule update --init --recursive
14+
git lfs pull
15+
```
16+
Create a TensorRT-LLM Docker image and approximate disk space required to build the image is 63 GB:
17+
```bash
18+
make -C docker release_build
19+
```
20+
21+
After launching the docker image, please install the following dependency:
22+
```bash
23+
pip install git+https://github.com/bfshi/scaling_on_scales.git
24+
pip install git+https://github.com/huggingface/[email protected]
25+
```
26+
## Build TensorRT engine of VILA model
27+
28+
### For Vila 1.0:
29+
30+
Please refer to the [documentation from TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-and-vila) to deploy the model.
31+
32+
### For Vila 1.5:
33+
34+
1. Setup
35+
```bash
36+
# clone vila
37+
git clone https://github.com/Efficient-Large-Model/VILA.git
38+
39+
# enter the demo folder
40+
cd <VILA-repo>/demo_trt_llm
41+
42+
# apply patch to /usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py for vila1.5
43+
sh apply_patch.sh
44+
45+
# download vila checkpoint
46+
export MODEL_NAME="vila1.5-2.7b"
47+
git clone https://huggingface.co/Efficient-Large-Model/${MODEL_NAME} tmp/hf_models/${MODEL_NAME}
48+
```
49+
50+
2. TensorRT Engine building using `FP16` and inference
51+
52+
Build TensorRT engine for LLaMA part of VILA from HF checkpoint using `FP16`:
53+
```bash
54+
python convert_checkpoint.py \
55+
--model_dir tmp/hf_models/${MODEL_NAME} \
56+
--output_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
57+
--dtype float16
58+
59+
trtllm-build \
60+
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/fp16/1-gpu \
61+
--output_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
62+
--gemm_plugin float16 \
63+
--use_fused_mlp \
64+
--max_batch_size 1 \
65+
--max_input_len 2048 \
66+
--max_output_len 512 \
67+
--max_multimodal_len 4096
68+
```
69+
70+
3. Build TensorRT engines for visual components
71+
72+
```bash
73+
python build_visual_engine.py --model_path tmp/hf_models/${MODEL_NAME} --model_type vila --vila_path ../
74+
```
75+
76+
4. Run the example script
77+
```bash
78+
python run.py \
79+
--max_new_tokens 100 \
80+
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
81+
--visual_engine_dir visual_engines/${MODEL_NAME} \
82+
--llm_engine_dir trt_engines/${MODEL_NAME}/fp16/1-gpu \
83+
--image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
84+
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
85+
--run_profiling
86+
87+
# example output:
88+
...
89+
[Q] <image>\n<image>\n Please elaborate what you see in the images?
90+
[04/30/2024-21:32:11] [TRT-LLM] [I]
91+
[A] ['The first image shows a busy street scene with a car driving through a crosswalk. There are several people walking on the sidewalk, and a cyclist is also visible. The second image captures a beautiful sunset with the iconic Merlion statue spouting water into the water body in the foreground. The Merlion statue is a famous landmark in Singapore, and the water spout is a popular feature of the statue.']
92+
...
93+
```
94+
95+
5. (Optional) One can also use VILA with other quantization options, like SmoothQuant and INT4 AWQ, that are supported by LLaMA. Instructions in LLaMA README to enable SmoothQuant and INT4 AWQ can be re-used to generate quantized TRT engines for LLM component of VILA.
96+
```bash
97+
python quantization/quantize.py \
98+
--model_dir tmp/hf_models/${MODEL_NAME} \
99+
--output_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
100+
--dtype float16 \
101+
--qformat int4_awq \
102+
--calib_size 32
103+
104+
trtllm-build \
105+
--checkpoint_dir tmp/trt_models/${MODEL_NAME}/int4_awq/1-gpu \
106+
--output_dir trt_engines/${MODEL_NAME}/int4_awq/1-gpu \
107+
--gemm_plugin float16 \
108+
--max_batch_size 1 \
109+
--max_input_len 2048 \
110+
--max_output_len 512 \
111+
--max_multimodal_len 4096
112+
113+
python run.py \
114+
--max_new_tokens 100 \
115+
--hf_model_dir tmp/hf_models/${MODEL_NAME} \
116+
--visual_engine_dir visual_engines/${MODEL_NAME} \
117+
--llm_engine_dir trt_engines/${MODEL_NAME}/int4_awq/1-gpu \
118+
--image_file=av.png,https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/merlion.png \
119+
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
120+
--run_profiling
121+
```

demo_trt_llm/apply_patch.sh

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/bin/bash
2+
3+
# Define the file to be modified
4+
FILE_PATH="/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/convert.py"
5+
6+
# Backup the original file before modification
7+
cp $FILE_PATH "${FILE_PATH}.bak"
8+
9+
# Replace the strings
10+
# sed -i ':a;N;$!ba;s|hf_config = LlavaConfig.from_pretrained(hf_model).text_config|hf_config = LlavaConfig.from_pretrained(hf_model).text_config\n if hf_config.model_type == "llava_llama":\n hf_config.llm_cfg["architecture"] = hf_config.llm_cfg["architectures"]\n hf_config.llm_cfg["dtype"] = hf_config.llm_cfg["torch_dtype"]\n hf_config = PretrainedConfig.from_dict(hf_config.llm_cfg)|g' $FILE_PATH
11+
sed -i ':a;N;$!ba;s|if "vila" in model_dir:\n sys.path.append(model_dir + "/../VILA")\n from llava.model import LlavaConfig, LlavaLlamaForCausalLM\n AutoConfig.register("llava_llama", LlavaConfig)\n AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)|# if "vila" in model_dir:\n# sys.path.append(model_dir + "/../VILA")\n# from llava.model import LlavaConfig, LlavaLlamaForCausalLM\n# AutoConfig.register("llava_llama", LlavaConfig)\n# AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)|g' $FILE_PATH
12+
13+
# Inform the user
14+
echo "Replacement done. Original file backed up as ${FILE_PATH}.bak"

demo_trt_llm/av.png

375 KB
Loading

0 commit comments

Comments
 (0)