This repository provides baseline implementations for the ImageCLEF 2025 Multimodal Reasoning - Visual Question Answering (VQA) task. The baselines use vision-language models (VLMs) and large language models (LLMs) to solve image-based multiple-choice questions in a zero-shot setting.
Given an image of a multiple-choice question (MCQ), the task is to:
- Identify the question and all answer options from the image or extracted caption.
- Understand any relevant visual content (e.g., graphs, tables).
- Predict the correct answer only based on the provided image or caption.
The submission file MUST have the following format:
id
: Unique Identifier (matching to a sample from the Test set).answer_key
: Ground truth label (one of "A", "B", "C", "D", or "E")language
: Question language
Additional formatting rules:
- Submission MUST be the same size as the Test set. For single-language submissions, we expect the size to match the respective test data for that language.
- Submission MUST NOT contain duplicates (There will be an evaluation Error!)
answer_key
must be EXACTLY ONE of "A", "B", "C", "D", or "E".
Correct submission file example:
[
{
"id": "5e9sf6b9-3338-4e97-ba6b-762e24a07e69",
"answer_key": "A",
"language": "English"
},
{
"id": "08fjguy8-4e97-12s4-bt65-385f09dsk5df",
"answer_key": "C",
"language": "English"
},
...
]
The evaluation metric for the task is accuracy: correct / total_questions.
We provide an evaluation script, that you can use locally, located in evaluation/evaluate.py
.
Example usage:
python evaluate.py --pred_file="./pred.json" --gold_file="./gold.json" --print_score="True"
We provide two types of baselines:
Use the image directly for reasoning in a zero-shot setting.
- Model path:
models/molmo
- Script:
baselines/molmo.py
- Model path:
models/smolvlm
- Script:
baselines/smolvlm.py
Use precomputed image captions for reasoning.
- Model path:
models/olmo
- Captions:
captions/Llama-3.2-11B-Vision/
orcaptions/SmolVLM/
- Script:
baselines/olmo.py
- Model path:
models/smollm
- Captions:
captions/Llama-3.2-11B-Vision/
orcaptions/SmolVLM/
- Script:
baselines/smollm.py
All models are evaluated in a zero-shot setting with no fine-tuning.
Each baseline uses a specific zero-shot prompt to guide reasoning:
- Prompt 1: A short, direct instruction for selecting the correct answer based on image or caption content only.
- Prompt 2: A step-by-step reasoning prompt encouraging deeper analysis of textual and visual cues (including multilingual content).
These models use the image as input. Examples include molmo.py
and smolvlm.py
.
Prompt 1:
Analyze the image of a multiple-choice question. Identify the question, all answer options (even if there are more than four), and any relevant visuals like graphs or tables. Choose the correct answer based only on the image. Reply with just the letter of the correct option, no explanation.
Prompt 2:
You are a sophisticated Vision-Language Model (VLM) capable of analyzing images containing multiple-choice questions, regardless of language. To guide your analysis, you may adopt the following process:
- Examine the image carefully for all textual and visual information.
- Identify the question text, even if it's in a different language.
- Extract all answer options (note: there may be more than four).
- Look for additional visual elements such as tables, diagrams, charts, or graphs.
- Ensure to consider any multilingual content present in the image.
- Analyze the complete context and data provided.
- Select the correct answer(s) based solely on your analysis.
- Respond by outputting only the corresponding letter(s) without any extra explanation.
To query a Vision-Language Model (VLM) with an image, follow these steps:
-
Convert the image to base64 format:
- Open the image file (e.g.,
.png
) in binary mode. - Encode the binary data using
base64.b64encode(...)
. - Prefix the encoded string with
data:image/png;base64,
to make it web-compatible.
- Open the image file (e.g.,
-
Format the input as an OpenAI-compatible chat message:
{
"role": "user",
"content": [
{
"type": "text",
"text": "<insert_prompt_text_here (prompt1 or prompt 2)>"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,<base64_encoded_image>"
}
}
]
}
These models use precomputed captions as input. Examples include olmo.py
and smollm.py
.
Prompt 1:
You are given a multiple-choice question extracted from an exam. The question is:
{caption}
Identify the question and all answer options (even if there are more than four), and any relevant data related to graphs or tables. Choose the correct answer and reply with just the letter of the correct option, no explanation.
Prompt 2:
You are given a multiple-choice question extracted from an exam.
The question is:{caption}
Please follow the steps below to determine the correct answer:
- Carefully read and interpret the full question text.
- Identify the main question, even if it is in a different language.
- Extract all available answer options (note: there may be more than four).
- Pay attention to any references to data, including tables, diagrams, charts, or graphs mentioned in the text.
- Take into account any multilingual elements present in the question.
- Analyze all information in context, both textual and inferred data.
- Select the correct answer based solely on your analysis.
- Respond by outputting only the letter(s) of the correct answer option, with no additional explanation.
Format the input as an OpenAI-compatible chat message:
{
"role": "user",
"content": prompt_text
}
ImageCLEF-2025-MultimodalReasoning/
βββ baselines/
β βββ molmo.py # VLM (image input)
β βββ smolvlm.py # VLM (image input)
β βββ olmo.py # LLM (caption input)
β βββ smollm.py # LLM (caption input)
βββ scripts/
β βββ molmo.sh # Launch molmo baseline
β βββ smolvlm.sh # Launch smolvlm baseline
β βββ olmo.sh # Launch olmo baseline
β βββ smollm.sh # Launch smollm baseline
βββ captions/
β βββ Llama-3.2-11B-Vision/ # Precomputed captions for olmo.py
β βββ SmolVLM/ # Precomputed captions for smollm.py
βββ data/
β βββ images/ # MCQ images (.png)
β βββ validation_data.json # Ground truth JSON
βββ models/ # Downloaded model folders
βββ logs/ # All log and result outputs
βββ run.sh # Entry point for selected baseline
Install required Python dependencies:
pip install -r requirements.txt
Ensure the models/
folder contains your downloaded vLLM-compatible models.
Unzip the caption files in captions/
:
unzip captions/Llama-3.2-11B-Vision.zip -d captions/
unzip captions/SmolVLM.zip -d captions/
bash scripts/molmo.sh
bash scripts/olmo.sh
You can also run run.sh
to default to a specific script (e.g., olmo).
Each evaluation script produces a JSON file:
[
{
"id": "image_001",
"language": "English",
"answer_key": "C"
},
...
]
The final accuracy is printed and logged in logs/result_<model>_log.txt
.
- The
OpenAI
interface is used for vLLM-compatible chat API calls. - All predictions use a guided choice mechanism: ["A", "B", "C", "D", "E"]
- Prompt 1 is enabled by default (short reasoning prompt). You may uncomment Prompt 2 for more detailed chain-of-thought-style prompting.
- Captions are extracted from vision models and serve as proxies for visual content in LLM pipelines.