Why do vision models need specific surgery files? #11139
Replies: 3 comments 12 replies
-
I could me mistaken here but I was under the impression that in the future it would be possible to convert a model which contains both a vision encoder and a language model into a single .gguf. And when the new vision api is in place it would then be possible to have a single .gguf file containing both. This is what I did for Llama 3.2 Vision in this branch. So perhaps if the above is true it might make sense to hold off on moving this to |
Beta Was this translation helpful? Give feedback.
-
@bartowski1182 Sorry about misleading you on this. It turned out my assumptions were incorrect. So I think your idea of adding a |
Beta Was this translation helpful? Give feedback.
-
To answer you original question @bartowski1182 , the reason why we need these "surgery" scripts was because historically, the llava example was developed outside of the main llama.cpp and
Of course, we can modify |
Beta Was this translation helpful? Give feedback.
-
For example, then recent Qwen2VL implementation.. Everything within the qwen2_vl_surgery.py is.. just python code, is there a reason it couldn't be added to convert_hf_to_gguf.py? We are already detecting the architecture is qwen2vl, seems simple enough to add all the surgery code into that block and have both be done in one, especially if we add an optional
--vision-adapter
param that's ignored for non-vision models and used with vision ones to specify that the vision adapter should be made.I ask because I was considering making the change, but if there's a specific reason that it hasn't been done that way I won't bother, maybe I'm missing something
Beta Was this translation helpful? Give feedback.
All reactions