Design LLM-Friendly DSL or JSON Schema for Declarative GUI Actions

### Description

This issue proposes a declarative, LLM-friendly schema for GUI actions. The goal is to define a structured output format that LLMs can reliably generate and that execution systems can interpret unambiguously. This schema will serve as the canonical interface between LLM-generated plans and OmniMCP's `InputController`, and will also be used in dataset generation, evaluation, and training workflows.

The format must support:
- Common GUI actions (click, type, scroll, hover, launch, wait, etc.)
- Multiple targeting strategies (track ID, element ID, text, bounding box)
- Extensibility for new actions and parameters (e.g. drag-and-drop, key combos)
- Easy validation and parsing
- Compatibility with prompt-based generation and structured output validation

---

### Background

In state-of-the-art LLM-based systems that perform actions (e.g. clicking UI elements, controlling robots, or navigating apps), a core design choice is how to represent actions, observations, and reasoning. This representation serves as both the format for prompting and the structure for training data.

A good protocol must define:
- What the LLM observes (screenshot, goal, state history)
- What the LLM outputs (action, reasoning, completion signal)
- How to serialize and interpret this for execution, replay, logging, and training

This format is foundational to behavioral cloning, few-shot prompting, fine-tuning, and system-level evaluation.

---

### Design Summary

We propose a **JSON-first schema** with optional DSL for human input. This schema will replace `LLMActionPlan` and extend `types.py` to enable structured LLM-agent communication.

#### Core Principles

- **JSON First:** Canonical format for parsing, validation, logging, and tool-use.
- **Unified `AgentAction`:** Common action structure with type, target, and parameters.
- **Rich Targeting:** `track_id` > `element_id` > `text` > `bbox`, resolved by executor.
- **Flexible Parameters:** Allow per-action arguments (e.g., text, scroll direction).
- **Integrated Reasoning:** Provide model reasoning alongside each decision.
- **Vision-First Compatibility:** Does not rely on structured DOMs; targets are resolved based on visual bounding boxes, OCR, and tracking.

---

### Schema (Pydantic-style)

```python
from pydantic import BaseModel, Field, Literal
from typing import Optional, Dict, Any, Tuple

Bounds = Tuple[float, float, float, float]  # (x, y, width, height)

class ActionTarget(BaseModel):
    track_id: Optional[str] = Field(None, description="Persistent tracking ID (preferred)")
    element_id: Optional[int] = Field(None, description="Per-frame UIElement ID")
    text: Optional[str] = Field(None, description="Text content match")
    bbox: Optional[Bounds] = Field(None, description="Normalized [x, y, w, h] box")

class AgentAction(BaseModel):
    action_type: Literal["click", "type", "scroll", "press_key", "wait", "finish_goal", "hover"]
    target: Optional[ActionTarget] = None
    parameters: Dict[str, Any] = Field(default_factory=dict)

class LLMResponse(BaseModel):
    reasoning: str
    action: AgentAction
    is_goal_complete: bool = False
```

---

### Examples

**Click tracked element:**

```json
{
  "reasoning": "Clicking the tracked login button (track_id 'btn_login_0').",
  "action": {
    "action_type": "click",
    "target": {
      "track_id": "btn_login_0",
      "text": "Login"
    },
    "parameters": {}
  },
  "is_goal_complete": false
}
```

**Type by text match:**

```json
{
  "reasoning": "Typing username 'testuser' into the 'Username' field.",
  "action": {
    "action_type": "type",
    "target": {
      "text": "Username"
    },
    "parameters": {
      "text_to_type": "testuser"
    }
  },
  "is_goal_complete": false
}
```

**Finish task:**

```json
{
  "reasoning": "Login successful, goal complete.",
  "action": {
    "action_type": "finish_goal",
    "target": null,
    "parameters": {}
  },
  "is_goal_complete": true
}
```

---

### Implementation Plan

1. **Add Models:** Integrate `ActionTarget`, `AgentAction`, `LLMResponse` into `types.py`.
2. **Deprecate `LLMActionPlan`:** Remove or mark old model as deprecated.
3. **Update Planner:** Refactor `core.plan_action_for_ui` to emit `LLMResponse`.
4. **Adapt Executor:** Parse `AgentAction`, resolve targets in priority order, and execute actions.
5. **Update Logging:** Store `LLMResponse` in `LoggedStep` with full reasoning and action trace.
6. **Use Schema:** Standardize traces, training data, and evaluation around this format.

---

### Goals

- Enable robust, interpretable communication between LLM planner and executor
- Provide a common representation for collected demonstrations and synthetic data
- Support structured prompting, validation, and traceability
- Serve as a foundation for future fine-tuning and dataset generation
- Interop with external protocols (Operator, LangChain tools, AWL) if needed

---

### Future Work

This schema lays the groundwork for:

1. **Fine-Tuning for Action Models**
   - Use collected traces in `LLMResponse` format as supervised training data
   - Enable instruction-conditioned, multimodal action prediction with vision-language models

2. **Process Dataset Standardization**
   - Store long-horizon user workflows in a common format across applications
   - Use for summarization, labeling, or synthesis

3. **Multimodal Alignment**
   - Align actions with screenshots or element segmentations for grounding
   - Apply to instruction-tuning, RAG, or distillation tasks

4. **Protocol Interoperability**
   - Convert to/from formats used by LangGraph, AWL, Operator
   - Support evaluation under common benchmarks or simulators

5. **Experimentation Framework**
   - Use in `experiments/` to evaluate prompting strategies, output formats, and agent variants (see Issue #28)

---

### Notes

- While JSON is canonical, a minimal DSL may be introduced later for authoring or direct prompting
- The priority-based targeting logic avoids reliance on any one modality (e.g., DOM, OCR)
- This schema will also enable structured error handling, retry, and verification in future planner loops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design LLM-Friendly DSL or JSON Schema for Declarative GUI Actions #26

Description

Background

Design Summary

Core Principles

Schema (Pydantic-style)

Examples

Implementation Plan

Goals

Future Work

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design LLM-Friendly DSL or JSON Schema for Declarative GUI Actions #26

Description

Description

Background

Design Summary

Core Principles

Schema (Pydantic-style)

Examples

Implementation Plan

Goals

Future Work

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions