Design and run controlled comparisons to evaluate prompting, protocol, and architecture hypotheses

### Description  

To guide OmniMCP development effectively, we need a lightweight framework for performing empirical comparisons across prompting strategies, input/output formats, memory configurations, and execution flows. The goal is to move quickly while validating assumptions with measurable results.

This issue defines the first step toward making our iteration loop scientific: identify hypotheses, design minimal experiments, run controlled comparisons, and record outcomes.

This infrastructure will allow us to benchmark design decisions rather than rely on intuition or architectural preference.

---

### Goals  

- Evaluate prompting strategies (e.g., instruction-only, CoT, Set-of-Marks)
- Compare input formats (e.g., JSON vs DSL, screenshot + bounding boxes vs full DOM trees)
- Test output formats (structured tool calls, action DSLs, text-based plans)
- Assess memory models (no memory vs long context vs RAG vs tool-based memory)
- Track basic performance metrics (e.g., success/failure, steps-to-goal, LLM call count, latency)
- Generate comparative logs and/or summaries to inform design choices

---

### Approach  

1. **Define key hypotheses**  
   Examples:
   - "Set-of-Marks prompting improves UI targeting accuracy"
   - "Explicit bounding box context reduces hallucinated actions"
   - "Persistent memory improves multi-step task success rate"

2. **Design minimal controlled experiments**  
   - Fix seed/goals, vary only one factor at a time
   - Log outcomes in structured format (JSON + screenshots + plan/output)

3. **Log and visualize results**  
   - Store experiments in `experiments/`
   - Include timestamp, config, and outcomes for easy comparison
   - Add CLI option or helper script to run predefined comparisons

4. **Track outcomes and summarize findings**  
   - Use markdown summaries or dashboards to record results
   - Include pass/fail or success/failure with reproducible inputs

---

### Example Format (per experiment)  

```json
{
  "experiment_id": "exp_2024_04_07_prompting_variants",
  "task": "Login to demo app",
  "variant_a": {
    "strategy": "CoT",
    "success": true,
    "steps": 3,
    "llm_calls": 2
  },
  "variant_b": {
    "strategy": "Set-of-Marks",
    "success": false,
    "steps": 4,
    "llm_calls": 3
  }
}
```

---

### Tasks  

- [ ] Create `experiments/` folder with structured logging format  
- [ ] Define at least 3 initial hypotheses to test  
- [ ] Add CLI utility for running experiments with fixed configs  
- [ ] Document experiment design conventions (how to vary conditions)  
- [ ] Add Markdown templates for writing up results  

---

### Notes  

This framework should stay simple, fast, and low-overhead. The purpose is to support rapid iteration while grounding decisions in measurable outcomes. Over time, this could evolve into a benchmarking suite or contribute to a paper, but that is not the immediate goal.

This work complements protocol design and planner development by ensuring design choices are evaluated rather than assumed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design and run controlled comparisons to evaluate prompting, protocol, and architecture hypotheses #28

Description

Goals

Approach

Example Format (per experiment)

Tasks

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design and run controlled comparisons to evaluate prompting, protocol, and architecture hypotheses #28

Description

Description

Goals

Approach

Example Format (per experiment)

Tasks

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions