Description
🧩 Description
We need to define and implement a minimal but extensible protocol for representing GUI interaction sequences. This protocol will unify the visual state, action metadata, and interaction history into a single structured format—enabling consistent logging, dataset creation, LLM training, planning, and replay.
This format serves as the foundation for downstream systems including the Action Graph (#10), ModelDrivenVisualState, and planner/LLM interfaces.
🧠 Background
OmniMCP currently:
- Captures visual state via OmniParser
- Plans actions using an LLM
- Executes actions via InputController
But there is no standardized, reusable format for representing:
- What was seen
- What was done
- Why it was done (optional)
This protocol fills that gap—similar to what OpenAI Operator, Adept’s AWL, and WebArena’s annotated programs use.
📦 Proposed Data Model (v0.1)
Using pydantic
for type safety and validation.
class BoundingBox(BaseModel):
x1: int
y1: int
x2: int
y2: int
class GUIElement(BaseModel):
element_id: str
tag: Optional[str] = None
text: Optional[str] = None
role: Optional[str] = None
bbox: Optional[BoundingBox] = None
visible: bool = True
class VisualState(BaseModel):
screenshot_path: str
screen_resolution: tuple[int, int]
elements: list[GUIElement]
timestamp: float
class GUIAction(BaseModel):
type: Literal["click", "type", "hover", "launch_app", "scroll"]
target_id: Optional[str] = None
bbox: Optional[BoundingBox] = None
text: Optional[str] = None
delay: Optional[float] = None # e.g. before typing
class InteractionStep(BaseModel):
timestamp: float
visual_state: VisualState
action: GUIAction
🧪 Examples
{
"timestamp": 4.1,
"visual_state": {
"screenshot_path": "frames/frame_002.png",
"screen_resolution": [1920, 1080],
"elements": [
{
"element_id": "url_bar",
"text": "Search or type URL",
"bbox": [120, 80, 800, 120],
"visible": true
}
]
},
"action": {
"type": "click",
"target_id": "url_bar",
"bbox": [120, 80, 800, 120]
}
}
✅ Acceptance Criteria
- Protocol spec exists as Python
pydantic
models with JSON schema export - Example logs (real or synthetic) stored in versioned
protocol/
directory - Validator for loading, validating, and pretty-printing logs
- Unit tests for schema validity and round-trip I/O
- Integration into
AgentExecutor
logging pipeline (optional, stub OK)
📚 References
- [Operator JSON schema examples](https://platform.openai.com/docs/guides/function-calling)
- [WebArena annotated programs](https://github.com/web-arena/WebArena)
- [Adept’s AWL DSL](https://www.adept.ai/blog/act-1)
- [MiniWoB++ trajectories](https://github.com/google/miniwob-plusplus)
- [OmniMCP Action Graph issue](Feature: Generate Action Graph from Interaction Log + Scene Snapshots #10)
📌 Priority
High. This is foundational to planning, replay, dataset creation, and eventual fine-tuning. Enables reuse of traces across components and simplifies future evaluation and debugging.