Agentic Benchmarking for GAIA

This repository contains a compact dump of code, data, and helpers to run, evaluate, and analyze GAIA benchmark submissions. It includes two variants of the GAIA agent (OAI and self‑hosted), an insights extraction pipeline for tracing/metrics, and scripts to evaluate JSONL submissions (single‑run accuracy and pass@N).

Top‑Level Contents

evaluate_gaia.py: Evaluate a single GAIA‑style submission JSONL against a reference JSONL (validation set) and report accuracy with detailed mismatches.
pass_at_n_acc.py: Compute pass‑at‑N accuracy across multiple submissions with a common prefix (e.g., my_run_1.jsonl, my_run_2.jsonl, …).
Agent217_test_set_submission.jsonl: Example submission for a private test set generated with pass@N strategies. This cannot be scored without private references, but it shows the expected JSONL format.
GAIA_agent_design/: Main GAIA agent implementation using the OpenAI Agents SDK (planning → searching → writing → evaluating → judging). See that folder’s AGENT.MD and README.md for details.
GAIA_self_hosted_agent/: Self‑hosted variant targeting a vLLM OpenAI‑compatible endpoint. See GAIA_self_hosted_agent/AGENT.MD and GAIA_self_hosted_agent/README.md for setup and usage.
insights_extraction/: Pipeline to query Logfire, fetch full traces, generate per‑trace insights (tokens, durations, roles, levels), and produce figures. See insights_extraction/README.md.
requirements.in / requirements.txt: Python dependencies.

Python Version and Setup

Python 3.10+ recommended.
From the repo root:
```
pip install -r requirements.txt
```

Evaluate a Single Submission (Validation Set)

evaluate_gaia.py compares your submission JSONL to a GAIA validation/reference JSONL that contains the ground truth answers. The scorer handles GAIA’s comparison rules (numeric normalization, list answers, punctuation/whitespace insensitivity).

Usage:

python evaluate_gaia.py SUBMISSION_JSONL REFERENCE_JSONL

Example (using the bundled GAIA validation metadata):

python evaluate_gaia.py my_submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl

Output: overall accuracy plus a list of mismatches with task IDs, questions, predicted answers, and references.

Compute Pass‑at‑N Accuracy

If you ran the agent N times and wrote my_run_1.jsonl … my_run_N.jsonl, use pass_at_n_acc.py to count a task as correct if any of the N submissions got it right.

Usage:

python pass_at_n_acc.py REFERENCE_JSONL RUN_PREFIX N [OUTPUT_TXT]

RUN_PREFIX is the common stem without the trailing _1.jsonl suffix. For files
```
my_run_1.jsonl
my_run_2.jsonl
my_run_3.jsonl
```
use RUN_PREFIX=my_run and N=3.

Example:

python pass_at_n_acc.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/my_run 5 results/pass_at_5.txt

The report prints to stdout and, if OUTPUT_TXT is provided, also writes to that file.

Private Test‑Set Submission

Agent217_test_set_submission.jsonl is an example JSONL submission for a private test set produced via pass@N strategies. It demonstrates the expected {"task_id", "model_answer", ...} shape for leaderboard submission. This file cannot be evaluated here without access to the private ground truth.

Folders at a Glance

GAIA_agent_design/
- Implements the multi‑agent GAIA workflow using the Agents SDK.
- Includes research_bot/ (agents and manager), agents_lib/ (processors/tools), and classified_tests/ (validation metadata and small batches).
- See GAIA_agent_design/AGENT.MD and GAIA_agent_design/README.md for full details and run commands.
GAIA_self_hosted_agent/
- Mirrors the GAIA workflow but routes model calls to your own vLLM OpenAI‑compatible server via vllm_client.py.
- Configure VLLM_BASE_URL, VLLM_API_KEY, and VLLM_MODEL in your environment.
- See GAIA_self_hosted_agent/AGENT.MD and GAIA_self_hosted_agent/README.md.
insights_extraction/
- Trace analytics pipeline (Logfire → full spans → per‑trace insights → figures).
- Structured as scripts/, data/{raw,processed,insights,inputs}/, figs/validation/, results/.
- See insights_extraction/README.md for end‑to‑end instructions.

Design Docs

Agent Landscape (PDF): agent landscape.pdf
Agent Workflow (PDF): agent workflow.pdf
Experiment Notes (PDF): Experiment Notes.pdf

Docker Usage

Images provided under docker/:
- docker/Dockerfile.agent_design: GAIA design agent runner.
- docker/Dockerfile.agent_self_hosted: vLLM self‑hosted agent runner.
- docker/Dockerfile.eval: Submission evaluator.
- docker/Dockerfile.insights: Insights extraction pipeline.
- docker/docker-compose.yml: Example services and wiring.
Build images (from repo root):
- docker build -f docker/Dockerfile.agent_design -t gaia-agent:design .
- docker build -f docker/Dockerfile.agent_self_hosted -t gaia-agent:self-hosted .
- docker build -f docker/Dockerfile.eval -t gaia-agent:eval .
- docker build -f docker/Dockerfile.insights -t gaia-agent:insights .
Run with docker:
- Design agent:
  - docker run --rm -e OPENAI_API_KEY=$OPENAI_API_KEY -v "$PWD/out:/app/out" gaia-agent:design GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl
- Self‑hosted agent (requires vLLM server):
  - docker run --rm -e VLLM_BASE_URL=http://<host>:8000/v1 -e VLLM_API_KEY=x -e VLLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct" -v "$PWD/out:/app/out" gaia-agent:self-hosted GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl
- Evaluate submission:
  - docker run --rm -v "$PWD/out:/app/out" gaia-agent:eval out/submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl
- Insights (Logfire):
  - docker run --rm -e LOGFIRE_READ_TOKEN=$LOGFIRE_READ_TOKEN -v "$PWD/insights_extraction/data:/app/insights_extraction/data" -v "$PWD/insights_extraction/figs:/app/insights_extraction/figs" gaia-agent:insights --parquet insights_extraction/data/processed/validation_traces_full.parquet --metadata GAIA_agent_design/classified_tests/metadata.sorted.jsonl --out insights_extraction/data/insights/validation_traces_insights
Run with docker compose (from docker/):
- Copy env template: cp ../.env.example ../.env and fill values.
- Self‑hosted agent: docker compose up --build agent_self_hosted
- Design agent: docker compose up --build agent_design
- Evaluate: docker compose up --build eval
- Insights: docker compose up --build insights
Env and volumes:
- See .env.example for OPENAI_API_KEY, VLLM_BASE_URL, VLLM_API_KEY, VLLM_MODEL, LOGFIRE_READ_TOKEN.
- Outputs mount to ./out and ./insights_extraction/{data,figs} as shown in commands.

Local Setup (Clone)

Clone and create a virtualenv:
- git clone <this-repo-url> && cd <repo>
- python -m venv .venv && source .venv/bin/activate (Windows: .\.venv\Scripts\activate)
- pip install -r requirements.txt
Run the GAIA design agent (OpenAI-compatible endpoint):
- Set OPENAI_API_KEY if calling OpenAI or another compatible provider.
- Single run: python -m GAIA_agent_design.research_bot.main GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl
- Multi-run: python GAIA_agent_design/run_gaia_manager.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl 5 20
Run the self-hosted agent (vLLM):
- export VLLM_BASE_URL=http://<host>:8000/v1
- export VLLM_API_KEY=x (if required)
- export VLLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct" (or your model)
- Single run: python -m GAIA_self_hosted_agent.research_bot.main GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl
- Multi-run: python GAIA_self_hosted_agent/run_gaia_manager.py GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl 3 20
Evaluate submissions (validation set):
- python evaluate_gaia.py out/submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl
- Pass@N: python pass_at_n_acc.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/my_run 5 results/pass_at_5.txt
Insights pipeline (Logfire):
- export LOGFIRE_READ_TOKEN=...
- Fetch recent traces: python insights_extraction/scripts/logfire_client_example.py
- Fetch full spans: python insights_extraction/scripts/fetch_traces_from_id.py
- Compute insights: python insights_extraction/scripts/parse_insights.py --parquet insights_extraction/data/processed/validation_traces_full.parquet --metadata GAIA_agent_design/classified_tests/metadata.sorted.jsonl --out insights_extraction/data/insights/validation_traces_insights
- Plot figures: python insights_extraction/scripts/plot_insights.py

Notes

Design PDFs: See "Design Docs" above for inline previews, or open directly — Agent Landscape, Agent Workflow, Experiment Notes. These provide high‑level context on the system and summarize experiments and observations collected during development.
The slide deck group_lunch_8_14.pptx contains presentation slides used to share an overview of the approach and findings; it’s a good visual companion to the experiment notes.
Validation data lives under GAIA_agent_design/classified_tests/. The self‑hosted folder contains a similar copy for convenience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agentic Benchmarking for GAIA

Top‑Level Contents

Python Version and Setup

Evaluate a Single Submission (Validation Set)

Compute Pass‑at‑N Accuracy

Private Test‑Set Submission

Folders at a Glance

Design Docs

Docker Usage

Local Setup (Clone)

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
GAIA_agent_design		GAIA_agent_design
GAIA_self_hosted_agent		GAIA_self_hosted_agent
docker		docker
insights_extraction		insights_extraction
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Agent217_test_set_submission.jsonl		Agent217_test_set_submission.jsonl
Experiment Notes.pdf		Experiment Notes.pdf
README.MD		README.MD
agent landscape.pdf		agent landscape.pdf
agent workflow.pdf		agent workflow.pdf
evaluate_gaia.py		evaluate_gaia.py
group_lunch_8_14.pptx		group_lunch_8_14.pptx
pass_at_n_acc.py		pass_at_n_acc.py
requirements.in		requirements.in
requirements.txt		requirements.txt

stanford-ppl/Agentic_Benchmarking_For_GAIA

Folders and files

Latest commit

History

Repository files navigation

Agentic Benchmarking for GAIA

Top‑Level Contents

Python Version and Setup

Evaluate a Single Submission (Validation Set)

Compute Pass‑at‑N Accuracy

Private Test‑Set Submission

Folders at a Glance

Design Docs

Docker Usage

Local Setup (Clone)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages