Description
Description
This issue proposes a declarative, LLM-friendly schema for GUI actions. The goal is to define a structured output format that LLMs can reliably generate and that execution systems can interpret unambiguously. This schema will serve as the canonical interface between LLM-generated plans and OmniMCP's InputController
, and will also be used in dataset generation, evaluation, and training workflows.
The format must support:
- Common GUI actions (click, type, scroll, hover, launch, wait, etc.)
- Multiple targeting strategies (track ID, element ID, text, bounding box)
- Extensibility for new actions and parameters (e.g. drag-and-drop, key combos)
- Easy validation and parsing
- Compatibility with prompt-based generation and structured output validation
Background
In state-of-the-art LLM-based systems that perform actions (e.g. clicking UI elements, controlling robots, or navigating apps), a core design choice is how to represent actions, observations, and reasoning. This representation serves as both the format for prompting and the structure for training data.
A good protocol must define:
- What the LLM observes (screenshot, goal, state history)
- What the LLM outputs (action, reasoning, completion signal)
- How to serialize and interpret this for execution, replay, logging, and training
This format is foundational to behavioral cloning, few-shot prompting, fine-tuning, and system-level evaluation.
Design Summary
We propose a JSON-first schema with optional DSL for human input. This schema will replace LLMActionPlan
and extend types.py
to enable structured LLM-agent communication.
Core Principles
- JSON First: Canonical format for parsing, validation, logging, and tool-use.
- Unified
AgentAction
: Common action structure with type, target, and parameters. - Rich Targeting:
track_id
>element_id
>text
>bbox
, resolved by executor. - Flexible Parameters: Allow per-action arguments (e.g., text, scroll direction).
- Integrated Reasoning: Provide model reasoning alongside each decision.
- Vision-First Compatibility: Does not rely on structured DOMs; targets are resolved based on visual bounding boxes, OCR, and tracking.
Schema (Pydantic-style)
from pydantic import BaseModel, Field, Literal
from typing import Optional, Dict, Any, Tuple
Bounds = Tuple[float, float, float, float] # (x, y, width, height)
class ActionTarget(BaseModel):
track_id: Optional[str] = Field(None, description="Persistent tracking ID (preferred)")
element_id: Optional[int] = Field(None, description="Per-frame UIElement ID")
text: Optional[str] = Field(None, description="Text content match")
bbox: Optional[Bounds] = Field(None, description="Normalized [x, y, w, h] box")
class AgentAction(BaseModel):
action_type: Literal["click", "type", "scroll", "press_key", "wait", "finish_goal", "hover"]
target: Optional[ActionTarget] = None
parameters: Dict[str, Any] = Field(default_factory=dict)
class LLMResponse(BaseModel):
reasoning: str
action: AgentAction
is_goal_complete: bool = False
Examples
Click tracked element:
{
"reasoning": "Clicking the tracked login button (track_id 'btn_login_0').",
"action": {
"action_type": "click",
"target": {
"track_id": "btn_login_0",
"text": "Login"
},
"parameters": {}
},
"is_goal_complete": false
}
Type by text match:
{
"reasoning": "Typing username 'testuser' into the 'Username' field.",
"action": {
"action_type": "type",
"target": {
"text": "Username"
},
"parameters": {
"text_to_type": "testuser"
}
},
"is_goal_complete": false
}
Finish task:
{
"reasoning": "Login successful, goal complete.",
"action": {
"action_type": "finish_goal",
"target": null,
"parameters": {}
},
"is_goal_complete": true
}
Implementation Plan
- Add Models: Integrate
ActionTarget
,AgentAction
,LLMResponse
intotypes.py
. - Deprecate
LLMActionPlan
: Remove or mark old model as deprecated. - Update Planner: Refactor
core.plan_action_for_ui
to emitLLMResponse
. - Adapt Executor: Parse
AgentAction
, resolve targets in priority order, and execute actions. - Update Logging: Store
LLMResponse
inLoggedStep
with full reasoning and action trace. - Use Schema: Standardize traces, training data, and evaluation around this format.
Goals
- Enable robust, interpretable communication between LLM planner and executor
- Provide a common representation for collected demonstrations and synthetic data
- Support structured prompting, validation, and traceability
- Serve as a foundation for future fine-tuning and dataset generation
- Interop with external protocols (Operator, LangChain tools, AWL) if needed
Future Work
This schema lays the groundwork for:
-
Fine-Tuning for Action Models
- Use collected traces in
LLMResponse
format as supervised training data - Enable instruction-conditioned, multimodal action prediction with vision-language models
- Use collected traces in
-
Process Dataset Standardization
- Store long-horizon user workflows in a common format across applications
- Use for summarization, labeling, or synthesis
-
Multimodal Alignment
- Align actions with screenshots or element segmentations for grounding
- Apply to instruction-tuning, RAG, or distillation tasks
-
Protocol Interoperability
- Convert to/from formats used by LangGraph, AWL, Operator
- Support evaluation under common benchmarks or simulators
-
Experimentation Framework
- Use in
experiments/
to evaluate prompting strategies, output formats, and agent variants (see Issue Design and run controlled comparisons to evaluate prompting, protocol, and architecture hypotheses #28)
- Use in
Notes
- While JSON is canonical, a minimal DSL may be introduced later for authoring or direct prompting
- The priority-based targeting logic avoids reliance on any one modality (e.g., DOM, OCR)
- This schema will also enable structured error handling, retry, and verification in future planner loops