You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am tryoing to SFT train llama3.2 11B vision instruct model. on a dataset that answer a question on an image using a context (could be more than one image). My code is:
def format_data(sample):
# Load images from the sample
images = load_images(sample.get("image", []))
# Extract images as needed
q_image = images[0]
# Extract the answer
answer = next((conv["value"] for conv in sample.get("conversations", []) if conv.get("from") == "gpt"), "no answer")
# Prepare the messages array for the model input
# now we define an initial model prompt defining the task and giving the model the context passage
instruction_prompt_template = '''
You are a helpful assistant tasked with answering questions from a given multimodal context (images and texts). Please infer the answer from the context and respond.'
Context: {context}'''
# Prepare the messages array for the model input
messages = [{"role": "user", "content": []}]
messages[0]['content'].append({"type": "text", "text": instruction_prompt_template.format(context=context)})
messages[0]["content"].append({"type": "image", "image": q_image})
messages[0]["content"].append({"type": "text", "text": question})
sample_conversation = tokenizer.apply_chat_template(messages, tokenize=False)
return {"text": sample_conversation, "messages": messages, "answer": answer}
and I am trying to define a collator function for the SFT trainer.
My first question, when I prepare the text column, do I format it like:
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant tasked with answering questions from a given context. Please infer the answer from the context and respond. Context: Most fossils are preserved by one of five processes outlined below (Figure 1.1): 1. What is the traditional definition of gravity? 2. Identify factors that influence the strength of gravity between two objects. Despite these problems, there is a rich fossil record. How does an organism become fossilized? <|image|> How many actions are depicted in the diagram?<|eot_id|> <|start_header_id|>assistant<|end_header_id|>7<|eot_id|>
by placing this <|image|> placeholder? or shall insert the actual image? or the path?
My second question. I am not sure how to define the collator function. I am getting all zero training loss and I think this is due to calculating the loss for the whole response. how can I defin this collator?
thank you in advance
The text was updated successfully, but these errors were encountered:
Please check out the documentation in the prompt format guide from here. The placement of the <|image|> tag is important. Text prompt should always be after the image tag not before that. Image is not part of the prompt. You can learn more about how the images are handled in llama from here
Hello,
I am tryoing to SFT train llama3.2 11B vision instruct model. on a dataset that answer a question on an image using a context (could be more than one image). My code is:
and I am trying to define a collator function for the SFT trainer.
<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant tasked with answering questions from a given context. Please infer the answer from the context and respond. Context: Most fossils are preserved by one of five processes outlined below (Figure 1.1): 1. What is the traditional definition of gravity? 2. Identify factors that influence the strength of gravity between two objects. Despite these problems, there is a rich fossil record. How does an organism become fossilized? <|image|> How many actions are depicted in the diagram?<|eot_id|> <|start_header_id|>assistant<|end_header_id|>7<|eot_id|>
by placing this <|image|> placeholder? or shall insert the actual image? or the path?
thank you in advance
The text was updated successfully, but these errors were encountered: