-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Hello, I'm not very familiar with the transformers library and it took me several days to barely understand the code. I noticed that the model's causal_mask is None while fine-tuning a text-to-image generation task — isn't that abnormal? Is it because the --unmask_image_logits argument was set in the public training script(exps/7B.sh)? If I want to fine-tune the text-to-image generation task, should I remove this argument?
Additionally, i make dataset by train.md, such as:
[ { "conversations":[ { "from": "human", "value": "Generate an image of 768x768 according to the following prompt:\n a dog." }, { "from": "gpt", "value": "<|image|>" } ], "image": ["./00.jpg"] }, { "conversations":[ { "from": "human", "value": "Generate an image of 768x768 according to the following prompt:\n a cat." }, { "from": "gpt", "value": "<|image|>" } ], "image": ["./01.jpg"] }, { "conversations":[ { "from": "human", "value": "Generate an image of 768x768 according to the following prompt:\n a horse." }, { "from": "gpt", "value": "<|image|>" } ], "image": ["./02.jpg"] } ]
Another issue I've observed is that Lumina_mgpt relies on the input prompt during inference to determine whether to generate image tokens or text tokens. However, I found that without using the specific template mentioned in the paper — such as “Generate an image of 1024x1024 according to the following prompt:...”, the model sometimes fails to generate images, e.g., when using a simpler prompt like “generate an image of dog.” This behavior seems unusual. Why should the decision between image and text generation be manually controlled by a FLAG rather than being handled automatically by the model?
I am really looking forward to your reply. Thank you!!