Skip to content

Design and run controlled comparisons to evaluate prompting, protocol, and architecture hypotheses #28

Open
@abrichr

Description

@abrichr

Description

To guide OmniMCP development effectively, we need a lightweight framework for performing empirical comparisons across prompting strategies, input/output formats, memory configurations, and execution flows. The goal is to move quickly while validating assumptions with measurable results.

This issue defines the first step toward making our iteration loop scientific: identify hypotheses, design minimal experiments, run controlled comparisons, and record outcomes.

This infrastructure will allow us to benchmark design decisions rather than rely on intuition or architectural preference.


Goals

  • Evaluate prompting strategies (e.g., instruction-only, CoT, Set-of-Marks)
  • Compare input formats (e.g., JSON vs DSL, screenshot + bounding boxes vs full DOM trees)
  • Test output formats (structured tool calls, action DSLs, text-based plans)
  • Assess memory models (no memory vs long context vs RAG vs tool-based memory)
  • Track basic performance metrics (e.g., success/failure, steps-to-goal, LLM call count, latency)
  • Generate comparative logs and/or summaries to inform design choices

Approach

  1. Define key hypotheses
    Examples:

    • "Set-of-Marks prompting improves UI targeting accuracy"
    • "Explicit bounding box context reduces hallucinated actions"
    • "Persistent memory improves multi-step task success rate"
  2. Design minimal controlled experiments

    • Fix seed/goals, vary only one factor at a time
    • Log outcomes in structured format (JSON + screenshots + plan/output)
  3. Log and visualize results

    • Store experiments in experiments/
    • Include timestamp, config, and outcomes for easy comparison
    • Add CLI option or helper script to run predefined comparisons
  4. Track outcomes and summarize findings

    • Use markdown summaries or dashboards to record results
    • Include pass/fail or success/failure with reproducible inputs

Example Format (per experiment)

{
  "experiment_id": "exp_2024_04_07_prompting_variants",
  "task": "Login to demo app",
  "variant_a": {
    "strategy": "CoT",
    "success": true,
    "steps": 3,
    "llm_calls": 2
  },
  "variant_b": {
    "strategy": "Set-of-Marks",
    "success": false,
    "steps": 4,
    "llm_calls": 3
  }
}

Tasks

  • Create experiments/ folder with structured logging format
  • Define at least 3 initial hypotheses to test
  • Add CLI utility for running experiments with fixed configs
  • Document experiment design conventions (how to vary conditions)
  • Add Markdown templates for writing up results

Notes

This framework should stay simple, fast, and low-overhead. The purpose is to support rapid iteration while grounding decisions in measurable outcomes. Over time, this could evolve into a benchmarking suite or contribute to a paper, but that is not the immediate goal.

This work complements protocol design and planner development by ensuring design choices are evaluated rather than assumed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions