I made it work on a single 3090 #42

Wojtab · 2023-04-23T03:24:38Z

Wojtab
Apr 23, 2023

There is no discussion tab, so opening it as an issue.
I made it work on a single 3090, in ooba's webui, see this PR for more info: oobabooga/text-generation-webui#1487.
There is even a small possibility that it will run on 12GB GPUs, as 4-bit vicuna 13b fits, question is if it fits with CLIP+projector

haotian-liu · 2023-04-23T04:13:05Z

haotian-liu
Apr 23, 2023
Maintainer

Thank you for your great work, this is really cool! I just turned on the discussions tab, and it would be great for handling these valuable discussions!

So, it seems that with 4/8-bit quantization, LLaVA can fit at least into a GPU with 24GB memory. I am working on a LLaVA version with the latest Vicuna code base, and I am planning this for the next week. I am not very familiar with 4-/8-bit quantization of LLMs yet. Do you think there are something I should specifically pay attention to? It seems that Vicuna has the support for 8-bit quantization already?

Thank you again!

3 replies

Wojtab Apr 23, 2023
Author

I'm not an expert on that, but from what I recall:
8-bit quantization is basically free, it's built into transformers and has a minimal performance impact.

The 4-bit quantization is a different story though. Simple quantization destroys the model performance, but there is a paper on so-called GPTQ quatnization, which also does layer-wise optimizations of the quantized weights, I'm not sure about the details. But the short of it is that 4-bit quantization(with GPTQ) has only a very small impact on model performance, but it cuts down memory requirements, and speeds up the inference. For LLaMA models there is this implementation, and it just worked for me (I actually used a fork with an older version to quantize it, as it's the one I have set up)

I think unless you change LLaMA architecture, the 4-bit quantization should just work out of the box with the GPTQ-for-LLaMa repo. The ones I tried inference on are llama, alpaca, vicuna, and now llava, they all work, so I think it's safe to assume that it's also going to work on new vicuna(there is a quantization available, but I didn't try it).
The only change for LLaVA is that I run CLIP/projector separately, as the quantized model only has the normal LLaMA layers.

haotian-liu Apr 23, 2023
Maintainer

Thanks for the detailed explanation! They are really helpful.

jackylee1 Apr 24, 2023

Are you able to run on rtx3060.8g. By the way can you make it able to process video

mudomau · 2023-04-25T17:26:19Z

mudomau
Apr 25, 2023

I just did a 4bit quantization with GPTQ, and loaded the model in textgen-web-ui (currently taking up 10.2GB of VRAM, running on a single GTX 3060).

EDIT: I just needed to open the webui in chat mode, now it's working!

1 reply

Ashish0898 Sep 24, 2023

Hi @mudomau, @Wojtab,

We're also performing experiments to check what is the minimum VRAM required to run the llava model.

When you say you've made it work with 12GB GPUs or 10.2 GB of VRAM, do you mean the dedicated VRAM or is it the Shared VRAM?

We know the model works perfectly where we've VRAM of 24 GB or more

Currently, we're trying to check the performance in a system which has 4GB of dedicated VRAM and 15 GB of shared VRAM.

I would really appreciate your inputs on this

Thanks in advance for your response!

Best Regards,
Ashish

mudomau · 2023-04-25T18:00:17Z

mudomau
Apr 25, 2023

There is no discussion tab, so opening it as an issue. I made it work on a single 3090, in ooba's webui, see this PR for more info: oobabooga/text-generation-webui#1487. There is even a small possibility that it will run on 12GB GPUs, as 4-bit vicuna 13b fits, question is if it fits with CLIP+projector

Can confirm, it works in a 12GB GPU with 4-bit quantization. It's taking up around 10.2 GB, I might go OOM if I try to give too large a context, but it runs.

0 replies

haotian-liu · 2023-04-25T23:28:51Z

haotian-liu
Apr 25, 2023
Maintainer

Hi @Wojtab, thank you for your great contribution. I got a chance to try it out today, and it's running with amazingly low RAM!

Two things I noticed:

You are using the mm_projector from our pretrained model, while in our second stage instruction tuning, this is also updated. I verified that they are still updated and it may be better to use the finetuned mm_projector. I uploaded the finetuned mm_projector here.

# a,b are pretrain/instruct-tuned
>>> key = 'model.mm_projector.bias'
>>> (a[key] - b[key]).norm()
tensor(0.1059)
>>> key = 'model.mm_projector.weight'
>>> (a[key] - b[key]).norm()
tensor(4.3163)
>>> key = 'model.embed_tokens.weight'
>>> (a[key] - b[key]).norm()
tensor(0.1165)

Currently it seems need to manually add "###" to the Custom stopping strings, do you know if there is a way to add that automatically, or we include in some configuration file/ command line?

Thank you!

1 reply

Wojtab Apr 30, 2023
Author

Thank you for the information, I created a PR changing it: LLaVA: small fixes oobabooga/text-generation-webui#1664
I didn't know it was a config option, but oobabooga, author of text-generation-webui already added it: oobabooga/text-generation-webui@d87ca8f

haotian-liu · 2023-04-30T18:23:26Z

haotian-liu
Apr 30, 2023
Maintainer

@Wojtab Great, thank you so much for your contribution! We have just recently released our 7b checkpoint. I noticed that there is a change in the <im_start> and <im_end> index due to the base checkpoint difference. Do you think it is easy to integrate that into the model? If not, we may hack the checkpoint a bit by adding dummy tokens?

4 replies

Wojtab Apr 30, 2023
Author

It isn't a big problem, I can easily change the token ids in my code, 7B requires changing the projector dimensions anyway, as hidden_size is 4096 instead of 5120. However, I'd like to make the multimodality support more generic first, before adding support for 7B, but if all goes well I'll add it next weekend.
I have 2 quick questions about the projector, does it take input from the same CLIP layer as 13B one? And is the embedding still 256 tokens?

haotian-liu Apr 30, 2023
Maintainer

Sounds great! Yes, the same CLIP ViT-L/14 is used and the number of tokens is still 256.

Btw, we are also working on a release on the latest Vicuna V1 checkpoint as well, which uses more standard special tokens: for example, <s> as BOS and </s> as EOS. ### is no longer used. A sample conversation would look like this (bold part are predicted by the model):

<SYSTEM_MESSAGE>
Human: Describe this image: <image> Assistant: an image description </s>
Human: Give more details. Assistant: an image description </s>

Do you see any other concerns that would better to be addressed together if we want to support V1 in the future?

There are no model changes in V1, just to make the prompts more standard, and with somewhat better performance.

Thank you again for your great contribution!

Wojtab Apr 30, 2023
Author

Changing the instruction template is very easy, it's configured in yaml files, so there should be no problem.

As long as the only change in LLM is either weights, biases, or number of input tokens, and the only input is the embeddings I'll be able to support it. I don't plan to support changes to the model structure, like adding additional "GATED XATTN-DENSE" in OpenFlamingo.

Out of curiosity, will vicuna v1 checkpoint use additional tokens, like currently, or do you plan on using a completely frozen LLM, as in MiniGPT-4?

haotian-liu Apr 30, 2023
Maintainer

Hi @Wojtab, sounds good, thanks.

LLaVA-v1.0 will still use additional tokens <im_start> and <im_end>, as I am planning to improve the multiple image reasoning, and these tokens are useful signals for supporting these. It will still do full model finetuning in this version, but I am planning to shift to parameter efficient tuning in the future, e.g. LORA.

Bdina1 · 2023-10-12T20:44:30Z

Bdina1
Oct 12, 2023

Hi, I am trying to train/finetune the base 7b model for llava. From the Readmeit says its possible, see [7/19] under Release. But cant seem to find an example.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I made it work on a single 3090 #42

{{title}}

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

I made it work on a single 3090 #42

Wojtab Apr 23, 2023

Replies: 6 comments · 9 replies

haotian-liu Apr 23, 2023 Maintainer

Wojtab Apr 23, 2023 Author

haotian-liu Apr 23, 2023 Maintainer

jackylee1 Apr 24, 2023

mudomau Apr 25, 2023

Ashish0898 Sep 24, 2023

mudomau Apr 25, 2023

haotian-liu Apr 25, 2023 Maintainer

Wojtab Apr 30, 2023 Author

haotian-liu Apr 30, 2023 Maintainer

Wojtab Apr 30, 2023 Author

haotian-liu Apr 30, 2023 Maintainer

Wojtab Apr 30, 2023 Author

haotian-liu Apr 30, 2023 Maintainer

Bdina1 Oct 12, 2023

Wojtab
Apr 23, 2023

Replies: 6 comments 9 replies

haotian-liu
Apr 23, 2023
Maintainer

Wojtab Apr 23, 2023
Author

haotian-liu Apr 23, 2023
Maintainer

mudomau
Apr 25, 2023

mudomau
Apr 25, 2023

haotian-liu
Apr 25, 2023
Maintainer

Wojtab Apr 30, 2023
Author

haotian-liu
Apr 30, 2023
Maintainer

Wojtab Apr 30, 2023
Author

haotian-liu Apr 30, 2023
Maintainer

Wojtab Apr 30, 2023
Author

haotian-liu Apr 30, 2023
Maintainer

Bdina1
Oct 12, 2023