Skip to content

Design LLM-Friendly DSL or JSON Schema for Declarative GUI Actions #26

Open
@abrichr

Description

@abrichr

Description

This issue proposes a declarative, LLM-friendly schema for GUI actions. The goal is to define a structured output format that LLMs can reliably generate and that execution systems can interpret unambiguously. This schema will serve as the canonical interface between LLM-generated plans and OmniMCP's InputController, and will also be used in dataset generation, evaluation, and training workflows.

The format must support:

  • Common GUI actions (click, type, scroll, hover, launch, wait, etc.)
  • Multiple targeting strategies (track ID, element ID, text, bounding box)
  • Extensibility for new actions and parameters (e.g. drag-and-drop, key combos)
  • Easy validation and parsing
  • Compatibility with prompt-based generation and structured output validation

Background

In state-of-the-art LLM-based systems that perform actions (e.g. clicking UI elements, controlling robots, or navigating apps), a core design choice is how to represent actions, observations, and reasoning. This representation serves as both the format for prompting and the structure for training data.

A good protocol must define:

  • What the LLM observes (screenshot, goal, state history)
  • What the LLM outputs (action, reasoning, completion signal)
  • How to serialize and interpret this for execution, replay, logging, and training

This format is foundational to behavioral cloning, few-shot prompting, fine-tuning, and system-level evaluation.


Design Summary

We propose a JSON-first schema with optional DSL for human input. This schema will replace LLMActionPlan and extend types.py to enable structured LLM-agent communication.

Core Principles

  • JSON First: Canonical format for parsing, validation, logging, and tool-use.
  • Unified AgentAction: Common action structure with type, target, and parameters.
  • Rich Targeting: track_id > element_id > text > bbox, resolved by executor.
  • Flexible Parameters: Allow per-action arguments (e.g., text, scroll direction).
  • Integrated Reasoning: Provide model reasoning alongside each decision.
  • Vision-First Compatibility: Does not rely on structured DOMs; targets are resolved based on visual bounding boxes, OCR, and tracking.

Schema (Pydantic-style)

from pydantic import BaseModel, Field, Literal
from typing import Optional, Dict, Any, Tuple

Bounds = Tuple[float, float, float, float]  # (x, y, width, height)

class ActionTarget(BaseModel):
    track_id: Optional[str] = Field(None, description="Persistent tracking ID (preferred)")
    element_id: Optional[int] = Field(None, description="Per-frame UIElement ID")
    text: Optional[str] = Field(None, description="Text content match")
    bbox: Optional[Bounds] = Field(None, description="Normalized [x, y, w, h] box")

class AgentAction(BaseModel):
    action_type: Literal["click", "type", "scroll", "press_key", "wait", "finish_goal", "hover"]
    target: Optional[ActionTarget] = None
    parameters: Dict[str, Any] = Field(default_factory=dict)

class LLMResponse(BaseModel):
    reasoning: str
    action: AgentAction
    is_goal_complete: bool = False

Examples

Click tracked element:

{
  "reasoning": "Clicking the tracked login button (track_id 'btn_login_0').",
  "action": {
    "action_type": "click",
    "target": {
      "track_id": "btn_login_0",
      "text": "Login"
    },
    "parameters": {}
  },
  "is_goal_complete": false
}

Type by text match:

{
  "reasoning": "Typing username 'testuser' into the 'Username' field.",
  "action": {
    "action_type": "type",
    "target": {
      "text": "Username"
    },
    "parameters": {
      "text_to_type": "testuser"
    }
  },
  "is_goal_complete": false
}

Finish task:

{
  "reasoning": "Login successful, goal complete.",
  "action": {
    "action_type": "finish_goal",
    "target": null,
    "parameters": {}
  },
  "is_goal_complete": true
}

Implementation Plan

  1. Add Models: Integrate ActionTarget, AgentAction, LLMResponse into types.py.
  2. Deprecate LLMActionPlan: Remove or mark old model as deprecated.
  3. Update Planner: Refactor core.plan_action_for_ui to emit LLMResponse.
  4. Adapt Executor: Parse AgentAction, resolve targets in priority order, and execute actions.
  5. Update Logging: Store LLMResponse in LoggedStep with full reasoning and action trace.
  6. Use Schema: Standardize traces, training data, and evaluation around this format.

Goals

  • Enable robust, interpretable communication between LLM planner and executor
  • Provide a common representation for collected demonstrations and synthetic data
  • Support structured prompting, validation, and traceability
  • Serve as a foundation for future fine-tuning and dataset generation
  • Interop with external protocols (Operator, LangChain tools, AWL) if needed

Future Work

This schema lays the groundwork for:

  1. Fine-Tuning for Action Models

    • Use collected traces in LLMResponse format as supervised training data
    • Enable instruction-conditioned, multimodal action prediction with vision-language models
  2. Process Dataset Standardization

    • Store long-horizon user workflows in a common format across applications
    • Use for summarization, labeling, or synthesis
  3. Multimodal Alignment

    • Align actions with screenshots or element segmentations for grounding
    • Apply to instruction-tuning, RAG, or distillation tasks
  4. Protocol Interoperability

    • Convert to/from formats used by LangGraph, AWL, Operator
    • Support evaluation under common benchmarks or simulators
  5. Experimentation Framework


Notes

  • While JSON is canonical, a minimal DSL may be introduced later for authoring or direct prompting
  • The priority-based targeting logic avoids reliance on any one modality (e.g., DOM, OCR)
  • This schema will also enable structured error handling, retry, and verification in future planner loops

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions