Skip to content

stanford-ppl/Agentic_Benchmarking_For_GAIA

Repository files navigation

Agentic Benchmarking for GAIA

This repository contains a compact dump of code, data, and helpers to run, evaluate, and analyze GAIA benchmark submissions. It includes two variants of the GAIA agent (OAI and self‑hosted), an insights extraction pipeline for tracing/metrics, and scripts to evaluate JSONL submissions (single‑run accuracy and pass@N).

Top‑Level Contents

  • evaluate_gaia.py: Evaluate a single GAIA‑style submission JSONL against a reference JSONL (validation set) and report accuracy with detailed mismatches.
  • pass_at_n_acc.py: Compute pass‑at‑N accuracy across multiple submissions with a common prefix (e.g., my_run_1.jsonl, my_run_2.jsonl, …).
  • Agent217_test_set_submission.jsonl: Example submission for a private test set generated with pass@N strategies. This cannot be scored without private references, but it shows the expected JSONL format.
  • GAIA_agent_design/: Main GAIA agent implementation using the OpenAI Agents SDK (planning → searching → writing → evaluating → judging). See that folder’s AGENT.MD and README.md for details.
  • GAIA_self_hosted_agent/: Self‑hosted variant targeting a vLLM OpenAI‑compatible endpoint. See GAIA_self_hosted_agent/AGENT.MD and GAIA_self_hosted_agent/README.md for setup and usage.
  • insights_extraction/: Pipeline to query Logfire, fetch full traces, generate per‑trace insights (tokens, durations, roles, levels), and produce figures. See insights_extraction/README.md.
  • requirements.in / requirements.txt: Python dependencies.

Python Version and Setup

  • Python 3.10+ recommended.
  • From the repo root:
    pip install -r requirements.txt

Evaluate a Single Submission (Validation Set)

evaluate_gaia.py compares your submission JSONL to a GAIA validation/reference JSONL that contains the ground truth answers. The scorer handles GAIA’s comparison rules (numeric normalization, list answers, punctuation/whitespace insensitivity).

Usage:

python evaluate_gaia.py SUBMISSION_JSONL REFERENCE_JSONL

Example (using the bundled GAIA validation metadata):

python evaluate_gaia.py my_submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl

Output: overall accuracy plus a list of mismatches with task IDs, questions, predicted answers, and references.

Compute Pass‑at‑N Accuracy

If you ran the agent N times and wrote my_run_1.jsonl … my_run_N.jsonl, use pass_at_n_acc.py to count a task as correct if any of the N submissions got it right.

Usage:

python pass_at_n_acc.py REFERENCE_JSONL RUN_PREFIX N [OUTPUT_TXT]
  • RUN_PREFIX is the common stem without the trailing _1.jsonl suffix. For files
    my_run_1.jsonl
    my_run_2.jsonl
    my_run_3.jsonl
    
    use RUN_PREFIX=my_run and N=3.

Example:

python pass_at_n_acc.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/my_run 5 results/pass_at_5.txt

The report prints to stdout and, if OUTPUT_TXT is provided, also writes to that file.

Private Test‑Set Submission

  • Agent217_test_set_submission.jsonl is an example JSONL submission for a private test set produced via pass@N strategies. It demonstrates the expected {"task_id", "model_answer", ...} shape for leaderboard submission. This file cannot be evaluated here without access to the private ground truth.

Folders at a Glance

  • GAIA_agent_design/

    • Implements the multi‑agent GAIA workflow using the Agents SDK.
    • Includes research_bot/ (agents and manager), agents_lib/ (processors/tools), and classified_tests/ (validation metadata and small batches).
    • See GAIA_agent_design/AGENT.MD and GAIA_agent_design/README.md for full details and run commands.
  • GAIA_self_hosted_agent/

    • Mirrors the GAIA workflow but routes model calls to your own vLLM OpenAI‑compatible server via vllm_client.py.
    • Configure VLLM_BASE_URL, VLLM_API_KEY, and VLLM_MODEL in your environment.
    • See GAIA_self_hosted_agent/AGENT.MD and GAIA_self_hosted_agent/README.md.
  • insights_extraction/

    • Trace analytics pipeline (Logfire → full spans → per‑trace insights → figures).
    • Structured as scripts/, data/{raw,processed,insights,inputs}/, figs/validation/, results/.
    • See insights_extraction/README.md for end‑to‑end instructions.

Design Docs

Docker Usage

  • Images provided under docker/:

    • docker/Dockerfile.agent_design: GAIA design agent runner.
    • docker/Dockerfile.agent_self_hosted: vLLM self‑hosted agent runner.
    • docker/Dockerfile.eval: Submission evaluator.
    • docker/Dockerfile.insights: Insights extraction pipeline.
    • docker/docker-compose.yml: Example services and wiring.
  • Build images (from repo root):

    • docker build -f docker/Dockerfile.agent_design -t gaia-agent:design .
    • docker build -f docker/Dockerfile.agent_self_hosted -t gaia-agent:self-hosted .
    • docker build -f docker/Dockerfile.eval -t gaia-agent:eval .
    • docker build -f docker/Dockerfile.insights -t gaia-agent:insights .
  • Run with docker:

    • Design agent:
      • docker run --rm -e OPENAI_API_KEY=$OPENAI_API_KEY -v "$PWD/out:/app/out" gaia-agent:design GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl
    • Self‑hosted agent (requires vLLM server):
      • docker run --rm -e VLLM_BASE_URL=http://<host>:8000/v1 -e VLLM_API_KEY=x -e VLLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct" -v "$PWD/out:/app/out" gaia-agent:self-hosted GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl
    • Evaluate submission:
      • docker run --rm -v "$PWD/out:/app/out" gaia-agent:eval out/submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl
    • Insights (Logfire):
      • docker run --rm -e LOGFIRE_READ_TOKEN=$LOGFIRE_READ_TOKEN -v "$PWD/insights_extraction/data:/app/insights_extraction/data" -v "$PWD/insights_extraction/figs:/app/insights_extraction/figs" gaia-agent:insights --parquet insights_extraction/data/processed/validation_traces_full.parquet --metadata GAIA_agent_design/classified_tests/metadata.sorted.jsonl --out insights_extraction/data/insights/validation_traces_insights
  • Run with docker compose (from docker/):

    • Copy env template: cp ../.env.example ../.env and fill values.
    • Self‑hosted agent: docker compose up --build agent_self_hosted
    • Design agent: docker compose up --build agent_design
    • Evaluate: docker compose up --build eval
    • Insights: docker compose up --build insights
  • Env and volumes:

    • See .env.example for OPENAI_API_KEY, VLLM_BASE_URL, VLLM_API_KEY, VLLM_MODEL, LOGFIRE_READ_TOKEN.
    • Outputs mount to ./out and ./insights_extraction/{data,figs} as shown in commands.

Local Setup (Clone)

  • Clone and create a virtualenv:

    • git clone <this-repo-url> && cd <repo>
    • python -m venv .venv && source .venv/bin/activate (Windows: .\.venv\Scripts\activate)
    • pip install -r requirements.txt
  • Run the GAIA design agent (OpenAI-compatible endpoint):

    • Set OPENAI_API_KEY if calling OpenAI or another compatible provider.
    • Single run: python -m GAIA_agent_design.research_bot.main GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl
    • Multi-run: python GAIA_agent_design/run_gaia_manager.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/submission.jsonl 5 20
  • Run the self-hosted agent (vLLM):

    • export VLLM_BASE_URL=http://<host>:8000/v1
    • export VLLM_API_KEY=x (if required)
    • export VLLM_MODEL="meta-llama/Meta-Llama-3-8B-Instruct" (or your model)
    • Single run: python -m GAIA_self_hosted_agent.research_bot.main GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl
    • Multi-run: python GAIA_self_hosted_agent/run_gaia_manager.py GAIA_self_hosted_agent/classified_tests/small_batch.jsonl out/self_hosted_submission.jsonl 3 20
  • Evaluate submissions (validation set):

    • python evaluate_gaia.py out/submission.jsonl GAIA_agent_design/classified_tests/metadata.sorted.jsonl
    • Pass@N: python pass_at_n_acc.py GAIA_agent_design/classified_tests/metadata.sorted.jsonl out/my_run 5 results/pass_at_5.txt
  • Insights pipeline (Logfire):

    • export LOGFIRE_READ_TOKEN=...
    • Fetch recent traces: python insights_extraction/scripts/logfire_client_example.py
    • Fetch full spans: python insights_extraction/scripts/fetch_traces_from_id.py
    • Compute insights: python insights_extraction/scripts/parse_insights.py --parquet insights_extraction/data/processed/validation_traces_full.parquet --metadata GAIA_agent_design/classified_tests/metadata.sorted.jsonl --out insights_extraction/data/insights/validation_traces_insights
    • Plot figures: python insights_extraction/scripts/plot_insights.py

Notes

  • Design PDFs: See "Design Docs" above for inline previews, or open directly — Agent Landscape, Agent Workflow, Experiment Notes. These provide high‑level context on the system and summarize experiments and observations collected during development.
  • The slide deck group_lunch_8_14.pptx contains presentation slides used to share an overview of the approach and findings; it’s a good visual companion to the experiment notes.
  • Validation data lives under GAIA_agent_design/classified_tests/. The self‑hosted folder contains a similar copy for convenience.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages