This project outlines the proposed codebase structure, directory layout, and key components for running zero-shot and few-shot (with and without CoT) experiments on Gemini 1.5, 2.0, and 2.5 VLMs using the Idioms Rebus Puzzle dataset.
vlm_rebus_experiments/
├── .env # Environment variables for GCP and API keys
├── README.md # Project overview and instructions
├── requirements.txt # Python dependencies
├── config/ # Configuration files for experiments
│ ├── base.yaml # Base settings (dataset paths, logging)
│ ├── gemini1.5.yaml # Model-specific overrides
│ ├── gemini2.0.yaml
│ └── gemini2.5.yaml
│
├── data/ # Dataset: raw images & annotations
│ ├── raw/ # Original puzzle images + annotations
│ │ ├── images/ # Raw puzzle images
│ │ └── annotations.csv # Ground-truth answers
│ └── load_data.py # Functions to list and load raw images and annotations
├── prompts/ # Prompt templates and builders
│ ├── templates/ # Jinja2 or .txt templates for each style
│ │ ├── zero_shot.txt
│ │ ├── fewshot2_cot.txt
│ │ ├── fewshot3_cot.txt
│ │ ├── fewshot2_nocot.txt
│ │ └── fewshot3_nocot.txt
│ └── builder.py # Functions to render prompts given examples
│
├── experiments/ # Experiment orchestration
│ ├── run_experiment.py # CLI entrypoint to run a single config
│ ├── evaluate.py # Scoring and metrics
│ └── utils.py # Helpers (logging, retry, batching)
│
├── models/ # Model wrappers and clients
│ ├── base_client.py # Wrapper around google-genai client setup
│ ├── gemini1_5.py # Instantiates 1.5 model with config
│ ├── gemini2_0.py # Instantiates 2.0-flash model
│ └── gemini2_5.py # Instantiates 2.5-pro model
│
├── logs/ # Generated logs and outputs
│ └── <timestamp>/ # Per-experiment output directories
│ ├── prompts/ # Raw prompts sent
│ ├── responses/ # JSON or text responses
│ └── metrics.json # Evaluation metrics
│
└── notebooks/ # Jupyter notebooks for analysis
├── explore_data.ipynb
└── compare_results.ipynb
This folder holds YAML files defining experiment settings. We load these configurations at runtime and merge base + model-specific overrides.
- config/base.yaml:
project: ${GOOGLE_CLOUD_PROJECT}
location: ${GOOGLE_CLOUD_LOCATION}
use_vertexai: ${GOOGLE_GENAI_USE_VERTEXAI}
dataset:
images_dir: "data/raw/images"
annotations_file: "data/raw/annotations.csv"
examples_dir: "data/examples"
logging:
level: "INFO"
dir: "logs"
prompt_styles:
- zero_shot
- fewshot2_cot
- fewshot3_cot
- fewshot2_nocot
- fewshot3_nocot
request:
batch_size: 4
timeout_seconds: 60- config/gemini1.5.yaml:
model:
name: "gemini-1.5-flash"
api_type: "studio"
api_key: ${GEMINI_API_KEY}
use_vertexai: false
max_output_tokens: 8192
supports_cot: true
context_window: 1048576- config/gemini2.0.yaml:
model:
name: "projects/${GOOGLE_CLOUD_PROJECT}/locations/${GOOGLE_CLOUD_LOCATION}/publishers/google/models/gemini-2.0-flash-001"
api_type: "vertex"
use_vertexai: true
max_output_tokens: 8192
supports_cot: true
context_window: 1048576- config/gemini2.5.yaml:
model:
name: "projects/${GOOGLE_CLOUD_PROJECT}/locations/${GOOGLE_CLOUD_LOCATION}/publishers/google/models/gemini-2.5-flash-preview-04-17"
api_type: "vertex"
use_vertexai: true
max_output_tokens: 65535
supports_cot: true
context_window: 1048576Loading the YAML in Python:
import os, yaml
from pathlib import Path
def load_config(cfg_filename):
base = yaml.safe_load(Path("config/base.yaml").read_text())
override = yaml.safe_load(Path(f"config/{cfg_filename}").read_text())
# simple merge: override keys replace base
base.update(override)
return base- Read all images directly from
dataset.images_dirand ground truth fromdataset.annotations_file. - No preprocessing is performed; raw images are fed to the models at inference time.
load_data.pyshould provide functions to iterate over image paths and retrieve corresponding annotations for evaluation.
-
Templates in
prompts/templates/use placeholder tokens for examples and target. -
builder.pyrenders a template using Jinja2 (or Python.format) with:- A list of
examples_countfromdataset.examples_diranddemo_answers.yaml. - The target image path.
- Optionally adding “Let’s think step by step.” if CoT is enabled.
- A list of
-
base_client.py: initializes either Studio or VertexAI client based onconfig.model.api_type. -
Each
gemini*.pyexposes agenerate(prompt, image_paths)function handling:- API calls with retries, error handling, and logging.
CLI usage:
python run_experiment.py \
--config gemini2.0.yaml \
--prompt-style fewshot2_cot \
--examples-count 2Steps:
- Load merged config.
- Resolve
prompt_style, adjust forsupports_cot. - Iterate all images in
images_dir. - For each, select
examples_countdemos, build prompt, call model, and write tologs/<timestamp>.
- Load
logs/<timestamp>/responses/and corresponding ground truth. - Compute metrics: exact match, F1, optionally other scores.
- Output a summary JSON in the same
logs/<timestamp>.
- explore_data.ipynb: inspect sample images, answer distributions.
- compare_results.ipynb: load metrics from multiple runs and plot comparisons.
- Install:
pip install -r requirements.txt - Configure: set your
.envwith credentials. - Run:
python experiments/run_experiment.py \
--config gemini2.0.yaml \
--prompt-style fewshot3_nocot \
--examples-count 3- Evaluate:
python experiments/evaluate.py --timestamp 20250520_120000- Analyze: open
notebooks/compare_results.ipynb.
Use this outline to guide the implementation top to bottom.