You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We use [LLaVA-CC3M-Pretrain-595K](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) to train the visual language projector
We use [LLaVA-CC3M-Pretrain-595K](https://huggingface.co/datasets/liuhaotian/LLaVA-CC3M-Pretrain-595K/blob/main/chat.json) to train the visual language projector
27
17
28
18
### MMC4-Core Dataset
29
-
Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
30
19
31
-
1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access [here](https://forms.gle/VYtcNY8aYaUANK9f8).
20
+
Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
21
+
22
+
1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access [here](https://forms.gle/VYtcNY8aYaUANK9f8).
32
23
33
24
2. Now modify the input and output path in `mmc4_downloader.py` and run the following script to scrawl the MMC4 images:
25
+
34
26
```bash
35
27
cd mmc4
36
28
python mmc4_downloader.py
37
29
```
38
-
Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
30
+
31
+
Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
39
32
40
33
The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:
2. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
62
+
2. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
68
63
69
64
There are totally 128 shards of annotations. Now download each one with the script:
65
+
70
66
```bash
71
67
cd coyo
72
68
forSHARDin {0..127};do
73
-
python coyo_downloader.py $SHARD
69
+
python coyo_downloader.py $SHARD
74
70
done
75
71
```
76
72
77
73
3. Split downloaded COYO data into multiple shards:
74
+
78
75
```bash
79
76
python coyo_splitter.py
80
77
```
81
78
82
-
### LLaVA-1.5 Instruction Data
79
+
### LLaVA-1.5 Instruction Data
83
80
84
81
We use this [file](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K/blob/main/llava_v1_5_mix665k.json) in our experiments. Please download this dataset from LLaVA authors.
You can follow this [page](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets) to prepare the data mixture that is proposed by LLaVA-Next.
127
+
128
+
### Shot2story
129
+
130
+
Please follow this [page](https://github.com/bytedance/Shot2Story/blob/master/DATA.md) to download the videos. The JSON file can be downloaded with
The ShareGPT data can be obtained [mit-han-lab/ShareGPT4V](https://huggingface.co/datasets/mit-han-lab/ShareGPT4V).
129
-
* Note the original ShareGPT4v dataset contains some samples with file ids (sa_XXXX) and repeative response. We filter those bad examples and reduced the samples from 100K -> 96K (for caption) and 1.2m -> 1.17m (for pretraining). Then we re-combine them into a single file.
138
+
### Video_ChatGPT
139
+
140
+
You can follow this [page](https://github.com/mbzuai-oryx/Video-ChatGPT/blob/main/README.md#video-instruction-dataset-open_file_folder) to prepare Video_ChatGPT dataset.
141
+
142
+
### Youcook2
143
+
144
+
Please follow this [page](http://youcook2.eecs.umich.edu/) to download the videos. The JSON file can be downloaded with
You can follow this [page](https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction) to prepare ShareGPT_Video dataset.
161
+
162
+
### WIT
163
+
164
+
The original WIT data can be obtained [google-research-datasets/wit](https://github.com/google-research-datasets/wit/tree/main). \* We subsample ~538K english data from the original WIT dataset and curate a llava conversation format JSON file.
We add some math data [gsm8k-ScRel](https://github.com/OFA-Sys/gsm8k-ScRel/blob/main/data/train_use.jsonl) to our SFT stage.
173
+
174
+
### Sherlock
175
+
176
+
The image files of Sherlock can be obtained from [VisualGenome](https://visualgenome.org/api/v0/api_home.html) and [VCR](https://visualcommonsense.com/download/) separately. The llava conversation format JSON file can be downloaded with
177
+
178
+
```bash
179
+
huggingface-cli download mit-han-lab/vila-dataset sherlock_317k.json --repo-type dataset --local-dir sherlock --local-dir-use-symlinks False
180
+
```
181
+
182
+
### ScienceQA
183
+
184
+
We use the train split of ScienceQA. The image data of the train split can be obtained from [ScienceQA](https://huggingface.co/datasets/derek-thomas/ScienceQA) or their [huggingface repo](https://huggingface.co/datasets/derek-thomas/ScienceQA). The llava conversation format JSON file can be downloaded with
Please refer to the [documentation from TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/multimodal#llava-and-vila) to deploy the model.
--input_text="<image>\n<image>\n Please elaborate what you see in the images?" \
85
+
--run_profiling
86
+
87
+
# example output:
88
+
...
89
+
[Q] <image>\n<image>\n Please elaborate what you see in the images?
90
+
[04/30/2024-21:32:11] [TRT-LLM] [I]
91
+
[A] ['The first image shows a busy street scene with a car driving through a crosswalk. There are several people walking on the sidewalk, and a cyclist is also visible. The second image captures a beautiful sunset with the iconic Merlion statue spouting water into the water body in the foreground. The Merlion statue is a famous landmark in Singapore, and the water spout is a popular feature of the statue.']
92
+
...
93
+
```
94
+
95
+
5. (Optional) One can also use VILA with other quantization options, like SmoothQuant and INT4 AWQ, that are supported by LLaMA. Instructions in LLaMA README to enable SmoothQuant and INT4 AWQ can be re-used to generate quantized TRT engines for LLM component of VILA.
0 commit comments