You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a visual question answering task that I want to train a VLM for using SFT. I want to train the VLM only on the completions and not on the prompt itself.
How do I use the SFTTrainer for that? For text only tasks, I can use the prompt-completion dataset type and offload everything to the SFTTrainer. Is that possible for multimodal datasets and VLMs? I went through the source code and I believe it should work fine but wondering if that's not the case.
Is it possible to avoid writing the collator? I believe this is a huge can of worms that I want to avoid as much as possible.