This repository contains a compact dump of code, data, and helpers to run, evaluate, and analyze GAIA benchmark submissions. It includes two variants of the GAIA agent (OAI and self‑hosted), an insights extraction pipeline for tracing/metrics, and scripts to evaluate JSONL submissions (single‑run accuracy and pass@N).
evaluate_gaia.py: Evaluate a single GAIA‑style submission JSONL against a reference JSONL (validation set) and report accuracy with detailed mismatches.pass_at_n_acc.py: Compute pass‑at‑N accuracy across multiple submissions with a common prefix (e.g.,my_run_1.jsonl,my_run_2.jsonl, …).Agent217_test_set_submission.jsonl: Example submission for a private test set generated with pass@N strategies. This cannot be scored without private references, but it shows the expected JSONL format.GAIA_agent_design/: Main GAIA agent implementation using the OpenAI Agents SDK (planning → searching → writing → evaluating → judging). See that folder’sAGENT.MDandREADME.mdfor details.GAIA_self_hosted_agent/: Self‑hosted variant targeting a vLLM OpenAI‑compatible endpoint. SeeGAIA_self_hosted_agent/AGENT.MDandGAIA_self_hosted_agent/README.mdfor setup and usage.insights_extraction/: Pipeline to query Logfire, fetch full traces, generate per‑trace insights (tokens, durations, roles, levels), and produce figures. Seeinsights_extraction/README.md.requirements.in/requirements.txt: Python dependencies.
- Python 3.10+ recommended.
- From the repo root:
pip install -r requirements.txt
evaluate_gaia.py compares your submission JSONL to a GAIA validation/reference JSONL that contains the ground truth answers. The scorer handles GAIA’s comparison rules (numeric normalization, list answers, punctuation/whitespace insensitivity).
Usage:
python evaluate_gaia.py SUBMISSION_JSONL REFERENCE_JSONLExample (using the bundled GAIA validation metadata):
python evaluate_gaia.py my_submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonlOutput: overall accuracy plus a list of mismatches with task IDs, questions, predicted answers, and references.
If you ran the agent N times and wrote my_run_1.jsonl … my_run_N.jsonl, use pass_at_n_acc.py to count a task as correct if any of the N submissions got it right.
Usage:
python pass_at_n_acc.py REFERENCE_JSONL RUN_PREFIX N [OUTPUT_TXT]RUN_PREFIXis the common stem without the trailing_1.jsonlsuffix. For filesusemy_run_1.jsonl my_run_2.jsonl my_run_3.jsonlRUN_PREFIX=my_runandN=3.
Example:
python pass_at_n_acc.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/my_run 5 results/pass_at_5.txtThe report prints to stdout and, if OUTPUT_TXT is provided, also writes to that file.
Agent217_test_set_submission.jsonlis an example JSONL submission for a private test set produced via pass@N strategies. It demonstrates the expected{"task_id", "model_answer", ...}shape for leaderboard submission. This file cannot be evaluated here without access to the private ground truth.
-
GAIA_agent_design/- Implements the multi‑agent GAIA workflow using the Agents SDK.
- Includes
research_bot/(agents and manager),agents_lib/(processors/tools), andclassified_tests/(validation metadata and small batches). - See
GAIA_agent_design/AGENT.MDandGAIA_agent_design/README.mdfor full details and run commands.
-
GAIA_self_hosted_agent/- Mirrors the GAIA workflow but routes model calls to your own vLLM OpenAI‑compatible server via
vllm_client.py. - Configure
VLLM_BASE_URL,VLLM_API_KEY, andVLLM_MODELin your environment. - See
GAIA_self_hosted_agent/AGENT.MDandGAIA_self_hosted_agent/README.md.
- Mirrors the GAIA workflow but routes model calls to your own vLLM OpenAI‑compatible server via
-
insights_extraction/- Trace analytics pipeline (Logfire → full spans → per‑trace insights → figures).
- Structured as
scripts/,data/{raw,processed,insights,inputs}/,figs/validation/,results/. - See
insights_extraction/README.mdfor end‑to‑end instructions.
- Agent Landscape (PDF): agent landscape.pdf
- Agent Workflow (PDF): agent workflow.pdf
- Experiment Notes (PDF): Experiment Notes.pdf
-
Images provided under
docker/:docker/Dockerfile.agent_design: GAIA design agent runner.docker/Dockerfile.agent_self_hosted: vLLM self‑hosted agent runner.docker/Dockerfile.eval: Submission evaluator.docker/Dockerfile.insights: Insights extraction pipeline.docker/docker-compose.yml: Example services and wiring.
-
Build images (from repo root):
docker build -f docker/Dockerfile.agent_design -t gaia-agent:design .docker build -f docker/Dockerfile.agent_self_hosted -t gaia-agent:self-hosted .docker build -f docker/Dockerfile.eval -t gaia-agent:eval .docker build -f docker/Dockerfile.insights -t gaia-agent:insights .
-
Run with docker:
- Design agent:
docker run --rm -e OPENAI_API_KEY=$OPENAI_API_KEY -v "$PWD/out:/app/out" gaia-agent:design GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl
- Self‑hosted agent (requires vLLM server):
docker run --rm -e VLLM_BASE_URL=http://<host>:8000/v1 -e VLLM_API_KEY=x -e VLLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct" -v "$PWD/out:/app/out" gaia-agent:self-hosted GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl
- Evaluate submission:
docker run --rm -v "$PWD/out:/app/out" gaia-agent:eval out/submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl
- Insights (Logfire):
docker run --rm -e LOGFIRE_READ_TOKEN=$LOGFIRE_READ_TOKEN -v "$PWD/insights_extraction/data:/app/insights_extraction/data" -v "$PWD/insights_extraction/figs:/app/insights_extraction/figs" gaia-agent:insights --parquet insights_extraction/data/processed/validation_traces_full.parquet --metadata GAIA_agent_design/classified_tests/metadata.sorted.jsonl --out insights_extraction/data/insights/validation_traces_insights
- Design agent:
-
Run with docker compose (from
docker/):- Copy env template:
cp ../.env.example ../.envand fill values. - Self‑hosted agent:
docker compose up --build agent_self_hosted - Design agent:
docker compose up --build agent_design - Evaluate:
docker compose up --build eval - Insights:
docker compose up --build insights
- Copy env template:
-
Env and volumes:
- See
.env.exampleforOPENAI_API_KEY,VLLM_BASE_URL,VLLM_API_KEY,VLLM_MODEL,LOGFIRE_READ_TOKEN. - Outputs mount to
./outand./insights_extraction/{data,figs}as shown in commands.
- See
-
Clone and create a virtualenv:
git clone <this-repo-url> && cd <repo>python -m venv .venv && source .venv/bin/activate(Windows:.\.venv\Scripts\activate)pip install -r requirements.txt
-
Run the GAIA design agent (OpenAI-compatible endpoint):
- Set
OPENAI_API_KEYif calling OpenAI or another compatible provider. - Single run:
python -m GAIA_agent_design.research_bot.main GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl - Multi-run:
python GAIA_agent_design/run_gaia_manager.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl 5 20
- Set
-
Run the self-hosted agent (vLLM):
export VLLM_BASE_URL=http://<host>:8000/v1export VLLM_API_KEY=x(if required)export VLLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct"(or your model)- Single run:
python -m GAIA_self_hosted_agent.research_bot.main GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl - Multi-run:
python GAIA_self_hosted_agent/run_gaia_manager.py GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl 3 20
-
Evaluate submissions (validation set):
python evaluate_gaia.py out/submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl- Pass@N:
python pass_at_n_acc.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/my_run 5 results/pass_at_5.txt
-
Insights pipeline (Logfire):
export LOGFIRE_READ_TOKEN=...- Fetch recent traces:
python insights_extraction/scripts/logfire_client_example.py - Fetch full spans:
python insights_extraction/scripts/fetch_traces_from_id.py - Compute insights:
python insights_extraction/scripts/parse_insights.py --parquet insights_extraction/data/processed/validation_traces_full.parquet --metadata GAIA_agent_design/classified_tests/metadata.sorted.jsonl --out insights_extraction/data/insights/validation_traces_insights - Plot figures:
python insights_extraction/scripts/plot_insights.py
- Design PDFs: See "Design Docs" above for inline previews, or open directly — Agent Landscape, Agent Workflow, Experiment Notes. These provide high‑level context on the system and summarize experiments and observations collected during development.
- The slide deck
group_lunch_8_14.pptxcontains presentation slides used to share an overview of the approach and findings; it’s a good visual companion to the experiment notes. - Validation data lives under
GAIA_agent_design/classified_tests/. The self‑hosted folder contains a similar copy for convenience.