Qusetion of training setting for text-to-image finetuning

Hello, I'm not very familiar with the transformers library and it took me several days to barely understand the code. I noticed that the model's causal_mask is None while fine-tuning a text-to-image generation task — isn't that abnormal? Is it because the --unmask_image_logits argument was set in the public training script(exps/7B.sh)? If I want to fine-tune the text-to-image generation task, should I remove this argument?

Additionally, i make dataset by train.md, such as:
`[
    {
        "conversations":[
            {
                "from": "human",
                "value":  "Generate an image of 768x768 according to the following prompt:\n a dog."
            },
            {
                "from": "gpt",
                "value": "<|image|>"
            }
        ],
        "image": ["./00.jpg"]
    },
    {
        "conversations":[
            {
                "from": "human",
                "value":  "Generate an image of 768x768 according to the following prompt:\n a cat."
            },
            {
                "from": "gpt",
                "value": "<|image|>"
            }
        ],
        "image": ["./01.jpg"]
    },
        {
        "conversations":[
            {
                "from": "human",
                "value":  "Generate an image of 768x768 according to the following prompt:\n a horse."
            },
            {
                "from": "gpt",
                "value": "<|image|>"
            }
        ],
        "image": ["./02.jpg"]
    }
]`

Another issue I've observed is that Lumina_mgpt relies on the input prompt during inference to determine whether to generate image tokens or text tokens. However, I found that without using the specific template mentioned in the paper — such as “Generate an image of 1024x1024 according to the following prompt:...”, the model sometimes fails to generate images, e.g., when using a simpler prompt like “generate an image of dog.” This behavior seems unusual. Why should the decision between image and text generation be manually controlled by a FLAG rather than being handled automatically by the model?

I am really looking forward to your reply. Thank you!!




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qusetion of training setting for text-to-image finetuning #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qusetion of training setting for text-to-image finetuning #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions