Description
Description
To guide OmniMCP development effectively, we need a lightweight framework for performing empirical comparisons across prompting strategies, input/output formats, memory configurations, and execution flows. The goal is to move quickly while validating assumptions with measurable results.
This issue defines the first step toward making our iteration loop scientific: identify hypotheses, design minimal experiments, run controlled comparisons, and record outcomes.
This infrastructure will allow us to benchmark design decisions rather than rely on intuition or architectural preference.
Goals
- Evaluate prompting strategies (e.g., instruction-only, CoT, Set-of-Marks)
- Compare input formats (e.g., JSON vs DSL, screenshot + bounding boxes vs full DOM trees)
- Test output formats (structured tool calls, action DSLs, text-based plans)
- Assess memory models (no memory vs long context vs RAG vs tool-based memory)
- Track basic performance metrics (e.g., success/failure, steps-to-goal, LLM call count, latency)
- Generate comparative logs and/or summaries to inform design choices
Approach
-
Define key hypotheses
Examples:- "Set-of-Marks prompting improves UI targeting accuracy"
- "Explicit bounding box context reduces hallucinated actions"
- "Persistent memory improves multi-step task success rate"
-
Design minimal controlled experiments
- Fix seed/goals, vary only one factor at a time
- Log outcomes in structured format (JSON + screenshots + plan/output)
-
Log and visualize results
- Store experiments in
experiments/
- Include timestamp, config, and outcomes for easy comparison
- Add CLI option or helper script to run predefined comparisons
- Store experiments in
-
Track outcomes and summarize findings
- Use markdown summaries or dashboards to record results
- Include pass/fail or success/failure with reproducible inputs
Example Format (per experiment)
{
"experiment_id": "exp_2024_04_07_prompting_variants",
"task": "Login to demo app",
"variant_a": {
"strategy": "CoT",
"success": true,
"steps": 3,
"llm_calls": 2
},
"variant_b": {
"strategy": "Set-of-Marks",
"success": false,
"steps": 4,
"llm_calls": 3
}
}
Tasks
- Create
experiments/
folder with structured logging format - Define at least 3 initial hypotheses to test
- Add CLI utility for running experiments with fixed configs
- Document experiment design conventions (how to vary conditions)
- Add Markdown templates for writing up results
Notes
This framework should stay simple, fast, and low-overhead. The purpose is to support rapid iteration while grounding decisions in measurable outcomes. Over time, this could evolve into a benchmarking suite or contribute to a paper, but that is not the immediate goal.
This work complements protocol design and planner development by ensuring design choices are evaluated rather than assumed.