Skip to content

Define Version 0.1 Protocol for GUI Interaction State and Action Sequences #25

Open
@abrichr

Description

@abrichr

🧩 Description

We need to define and implement a minimal but extensible protocol for representing GUI interaction sequences. This protocol will unify the visual state, action metadata, and interaction history into a single structured format—enabling consistent logging, dataset creation, LLM training, planning, and replay.

This format serves as the foundation for downstream systems including the Action Graph (#10), ModelDrivenVisualState, and planner/LLM interfaces.


🧠 Background

OmniMCP currently:

  • Captures visual state via OmniParser
  • Plans actions using an LLM
  • Executes actions via InputController

But there is no standardized, reusable format for representing:

  • What was seen
  • What was done
  • Why it was done (optional)

This protocol fills that gap—similar to what OpenAI Operator, Adept’s AWL, and WebArena’s annotated programs use.


📦 Proposed Data Model (v0.1)

Using pydantic for type safety and validation.

class BoundingBox(BaseModel):
    x1: int
    y1: int
    x2: int
    y2: int

class GUIElement(BaseModel):
    element_id: str
    tag: Optional[str] = None
    text: Optional[str] = None
    role: Optional[str] = None
    bbox: Optional[BoundingBox] = None
    visible: bool = True

class VisualState(BaseModel):
    screenshot_path: str
    screen_resolution: tuple[int, int]
    elements: list[GUIElement]
    timestamp: float

class GUIAction(BaseModel):
    type: Literal["click", "type", "hover", "launch_app", "scroll"]
    target_id: Optional[str] = None
    bbox: Optional[BoundingBox] = None
    text: Optional[str] = None
    delay: Optional[float] = None  # e.g. before typing

class InteractionStep(BaseModel):
    timestamp: float
    visual_state: VisualState
    action: GUIAction

🧪 Examples

{
  "timestamp": 4.1,
  "visual_state": {
    "screenshot_path": "frames/frame_002.png",
    "screen_resolution": [1920, 1080],
    "elements": [
      {
        "element_id": "url_bar",
        "text": "Search or type URL",
        "bbox": [120, 80, 800, 120],
        "visible": true
      }
    ]
  },
  "action": {
    "type": "click",
    "target_id": "url_bar",
    "bbox": [120, 80, 800, 120]
  }
}

✅ Acceptance Criteria

  • Protocol spec exists as Python pydantic models with JSON schema export
  • Example logs (real or synthetic) stored in versioned protocol/ directory
  • Validator for loading, validating, and pretty-printing logs
  • Unit tests for schema validity and round-trip I/O
  • Integration into AgentExecutor logging pipeline (optional, stub OK)

📚 References


📌 Priority

High. This is foundational to planning, replay, dataset creation, and eventual fine-tuning. Enables reuse of traces across components and simplifies future evaluation and debugging.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions