Description
After implementing the foundational OmniParser integration (Phases 1-3 in #2), this roadmap outlines the development of OmniMCP's spatial-temporal synthesis capabilities and high-impact demonstrations. The focus is on efficiently creating a compelling, usable framework with minimal complexity.
Phase 4: Core Process Graph & UI Understanding
Priority: High - Core differentiation
Dependencies: Completed Phase 3
4.1 Unified Data Source Architecture [Complexity: M]
- Create abstraction layer that works with both real and synthetic UI data
- Implement consistent interface for screen state acquisition and input control
- Support transparent switching between testing/production environments
- Add configuration-driven source selection with proper logging
This architecture creates a clean boundary between real and synthetic implementations:
class DataSourceManager:
"""Manages access to UI state and control with consistent interface."""
def get_screen_state(self) -> ScreenState:
"""Get current UI state from configured source."""
def perform_action(self, action: Action) -> ActionResult:
"""Execute action using configured input method."""
4.2 Process Graph Framework [Complexity: M]
- Implement data structures for representing UI interaction sequences
- Create process graph generation from synthetic UI scenarios
- Add state transition modeling and verification
- Implement graph visualization for debugging/demonstration
The process graph captures temporal patterns in UI interactions:
class ProcessGraph:
"""Represents UI automation sequences as directed graphs."""
def add_state(self, screen_state: ScreenState) -> str:
"""Add state to the graph and return state ID."""
def add_transition(self, source_id: str, target_id: str, action: Action) -> None:
"""Add transition between states based on action."""
def find_similar_states(self, screen_state: ScreenState) -> List[SimilarState]:
"""Find states similar to given screen state."""
def suggest_next_actions(self, current_state: ScreenState) -> List[ActionSuggestion]:
"""Suggest possible next actions based on graph."""
4.3 Core Omni API [Complexity: M]
- Implement fluent Omni API as specified in usage pattern
- Add session state management for multi-step operations
- Create natural language action execution
- Support conditional execution based on UI state
The core API follows the specified pattern for both programmatic and LLM use:
from omnimcp import Omni
omni = Omni()
with omni.session():
email = omni.recall("credentials.email")
if omni.is("Login form ready"):
omni.do(f"Enter {email}")
omni.do("Submit login")
omni.observe("latest transaction date")
.store("user.last_transaction_date")
This intuitive API encapsulates the sophisticated spatial-temporal understanding while providing a simple interface for automation:
class Omni:
"""Main entry point for OmniMCP framework."""
def session(self):
"""Create a new interaction session context manager."""
def recall(self, key: str) -> Any:
"""Recall value from session memory."""
def is_(self, state_description: str) -> bool:
"""Check if current UI matches description."""
def do(self, action_description: str) -> ActionResult:
"""Perform described action on UI."""
def observe(self, target_description: str) -> ObservationResult:
"""Extract information from current UI state."""
Phase 5: MCP Protocol Implementation
Priority: High - API Stability
Dependencies: Completed Phase 4
5.1 MCP Core Protocol [Complexity: M]
- Implement streamlined MCP protocol for LLM interaction
- Create efficient JSON serialization/deserialization
- Add proper error handling and response types
- Support context window management
The MCP protocol provides LLMs with access to the same Omni API:
# Core MCP functions that map directly to the Omni API
@mcp.tool()
async def recall(key: str) -> Any:
"""Recall value from session memory."""
@mcp.tool()
async def is_(state_description: str) -> bool:
"""Check if current UI matches description."""
@mcp.tool()
async def do(action_description: str) -> ActionResult:
"""Perform described action on UI."""
@mcp.tool()
async def observe(target_description: str) -> ObservationResult:
"""Extract information from current UI state."""
@mcp.tool()
async def store(key: str, value: Any) -> StoreResult:
"""Store value in session memory."""
5.2 Process Understanding [Complexity: M]
- Implement UI state analysis for semantic understanding
- Add temporal pattern recognition from interactions
- Create context-aware element targeting
- Support spatial relationship understanding
These capabilities power the natural language interface:
class UIStateAnalyzer:
"""Analyzes UI state for semantic understanding."""
def match_state_description(self, state_description: str) -> float:
"""Check how well current state matches description."""
def find_elements(self, element_description: str) -> List[UIElement]:
"""Find elements matching description."""
def extract_information(self, target_description: str) -> Any:
"""Extract described information from UI."""
5.3 Action Planning [Complexity: M]
- Implement multi-step action planning
- Add validation of proposed action sequences
- Create result verification for completed actions
- Support retry strategies for failed actions
This functionality enables goal-oriented automation:
@mcp.tool()
async def plan(goal_description: str) -> ActionPlan:
"""Create plan to achieve described goal on current UI."""
@mcp.tool()
async def execute_plan(plan: ActionPlan) -> PlanResult:
"""Execute action plan with validation steps."""
Phase 6: Demonstration & Documentation
Priority: Critical - Shows value
Dependencies: Completed Phase 5
6.1 Real-World Scenario Implementation [Complexity: M]
- Implement common UI automation scenarios (login, form filling, etc.)
- Create comprehensive examples showing framework capabilities
- Add detailed logging and visualization of execution
- Support both scripted and interactive demonstration
Example scenarios will showcase the framework in action:
def demonstrate_login_workflow():
"""Demonstrate login workflow with validation."""
omni = Omni()
with omni.session():
# Store credentials in session memory
omni.store("credentials.username", "testuser")
omni.store("credentials.password", "password123")
# Check if we're on login page
if omni.is("login page"):
# Enter credentials
username = omni.recall("credentials.username")
password = omni.recall("credentials.password")
omni.do(f"Enter {username} in username field")
omni.do(f"Enter {password} in password field")
omni.do("Click login button")
# Wait for dashboard to load
omni.wait_for("dashboard page")
# Extract and store account information
omni.observe("account number").store("user.account_number")
# Check balance
balance = omni.observe("current balance")
print(f"Current balance: {balance}")
else:
print("Not on login page")
6.2 Advanced Interaction Support [Complexity: M]
- Add support for more complex interactions (drag-and-drop, etc.)
- Implement handling for dynamic content and state changes
- Create strategies for error recovery
- Support batch operations across multiple UI elements
These capabilities will extend the core API:
# Extended Omni methods for advanced interactions
def wait_for(self, state_description: str, timeout: float = 10.0) -> bool:
"""Wait for UI to reach described state."""
def retry(self, action_description: str, max_attempts: int = 3) -> ActionResult:
"""Retry action until success or max attempts reached."""
def drag(self, source_desc: str, target_desc: str) -> ActionResult:
"""Drag element to target location."""
6.3 Documentation & Examples [Complexity: S]
- Create comprehensive API reference
- Develop clear usage guides with practical examples
- Document architecture and extension points
- Include troubleshooting and best practices
Implementation Strategy
The implementation will focus on simplicity and effectiveness:
- API-First Design: Center development around the intuitive Omni API
- Minimal Dependencies: Keep external dependencies to a minimum
- Test-Driven: Build comprehensive test suite to validate functionality
- Progressive Complexity: Start with core capabilities, then add features
- Real-World Testing: Continuously validate against real applications
Data Source Architecture
The data source abstraction provides a clean interface for both real and synthetic data:
# Factory method creates appropriate implementation based on configuration
def create_data_manager(config: Config) -> DataSourceManager:
"""Create data source manager based on configuration."""
if config.USE_SYNTHETIC:
return SyntheticDataManager(config)
else:
return RealDataManager(config)
# Real implementation uses actual screen capture and input
class RealDataManager(DataSourceManager):
"""Manages real UI interaction using mss and pynput."""
def get_screen_state(self) -> ScreenState:
"""Capture real screen state using mss."""
# Implementation uses mss for screen capture
def perform_action(self, action: Action) -> ActionResult:
"""Execute action using pynput."""
# Implementation uses pynput for input control
# Synthetic implementation for testing
class SyntheticDataManager(DataSourceManager):
"""Manages synthetic UI for testing."""
def get_screen_state(self) -> ScreenState:
"""Generate synthetic screen state."""
# Implementation creates synthetic UI representation
def perform_action(self, action: Action) -> ActionResult:
"""Simulate action on synthetic UI."""
# Implementation simulates action effects
Process Graph Implementation
The process graph will be the central data structure for temporal understanding:
class ProcessGraphNode:
"""Represents a state in the process graph."""
def __init__(self, state_id: str, screen_state: ScreenState):
self.state_id = state_id
self.screen_state = screen_state
self.metadata = {} # Additional state information
class ProcessGraphEdge:
"""Represents a transition between states in the process graph."""
def __init__(self,
source_id: str,
target_id: str,
action: Action):
self.source_id = source_id
self.target_id = target_id
self.action = action
self.success_rate = 1.0 # Initial perfect success rate
Next Steps
After completing Phases 1-3:
- Implement the DataSourceManager abstraction
- Create basic ProcessGraph implementation
- Implement the core Omni API as specified
- Build minimal MCP protocol implementation
- Develop demonstration scenarios
This streamlined approach focuses on creating a usable framework with minimal complexity while maintaining the powerful spatial-temporal understanding capabilities that differentiate OmniMCP.