Skip to content

OmniMCP Phase 4-6: Spatial-Temporal UI Understanding & Demonstration #3

Open
@abrichr

Description

@abrichr

After implementing the foundational OmniParser integration (Phases 1-3 in #2), this roadmap outlines the development of OmniMCP's spatial-temporal synthesis capabilities and high-impact demonstrations. The focus is on efficiently creating a compelling, usable framework with minimal complexity.

Phase 4: Core Process Graph & UI Understanding

Priority: High - Core differentiation
Dependencies: Completed Phase 3

4.1 Unified Data Source Architecture [Complexity: M]

- Create abstraction layer that works with both real and synthetic UI data
- Implement consistent interface for screen state acquisition and input control
- Support transparent switching between testing/production environments
- Add configuration-driven source selection with proper logging

This architecture creates a clean boundary between real and synthetic implementations:

class DataSourceManager:
    """Manages access to UI state and control with consistent interface."""
    
    def get_screen_state(self) -> ScreenState:
        """Get current UI state from configured source."""
        
    def perform_action(self, action: Action) -> ActionResult:
        """Execute action using configured input method."""

4.2 Process Graph Framework [Complexity: M]

- Implement data structures for representing UI interaction sequences
- Create process graph generation from synthetic UI scenarios
- Add state transition modeling and verification
- Implement graph visualization for debugging/demonstration

The process graph captures temporal patterns in UI interactions:

class ProcessGraph:
    """Represents UI automation sequences as directed graphs."""
    
    def add_state(self, screen_state: ScreenState) -> str:
        """Add state to the graph and return state ID."""
        
    def add_transition(self, source_id: str, target_id: str, action: Action) -> None:
        """Add transition between states based on action."""
        
    def find_similar_states(self, screen_state: ScreenState) -> List[SimilarState]:
        """Find states similar to given screen state."""
        
    def suggest_next_actions(self, current_state: ScreenState) -> List[ActionSuggestion]:
        """Suggest possible next actions based on graph."""

4.3 Core Omni API [Complexity: M]

- Implement fluent Omni API as specified in usage pattern
- Add session state management for multi-step operations
- Create natural language action execution
- Support conditional execution based on UI state

The core API follows the specified pattern for both programmatic and LLM use:

from omnimcp import Omni

omni = Omni()
with omni.session():
    email = omni.recall("credentials.email")
                            
    if omni.is("Login form ready"):
        omni.do(f"Enter {email}")
        omni.do("Submit login")
                            
    omni.observe("latest transaction date")
        .store("user.last_transaction_date")

This intuitive API encapsulates the sophisticated spatial-temporal understanding while providing a simple interface for automation:

class Omni:
    """Main entry point for OmniMCP framework."""
    
    def session(self):
        """Create a new interaction session context manager."""
        
    def recall(self, key: str) -> Any:
        """Recall value from session memory."""
        
    def is_(self, state_description: str) -> bool:
        """Check if current UI matches description."""
        
    def do(self, action_description: str) -> ActionResult:
        """Perform described action on UI."""
        
    def observe(self, target_description: str) -> ObservationResult:
        """Extract information from current UI state."""

Phase 5: MCP Protocol Implementation

Priority: High - API Stability
Dependencies: Completed Phase 4

5.1 MCP Core Protocol [Complexity: M]

- Implement streamlined MCP protocol for LLM interaction
- Create efficient JSON serialization/deserialization
- Add proper error handling and response types
- Support context window management

The MCP protocol provides LLMs with access to the same Omni API:

# Core MCP functions that map directly to the Omni API
@mcp.tool()
async def recall(key: str) -> Any:
    """Recall value from session memory."""

@mcp.tool()
async def is_(state_description: str) -> bool:
    """Check if current UI matches description."""

@mcp.tool()
async def do(action_description: str) -> ActionResult:
    """Perform described action on UI."""

@mcp.tool()
async def observe(target_description: str) -> ObservationResult:
    """Extract information from current UI state."""

@mcp.tool()
async def store(key: str, value: Any) -> StoreResult:
    """Store value in session memory."""

5.2 Process Understanding [Complexity: M]

- Implement UI state analysis for semantic understanding
- Add temporal pattern recognition from interactions
- Create context-aware element targeting
- Support spatial relationship understanding

These capabilities power the natural language interface:

class UIStateAnalyzer:
    """Analyzes UI state for semantic understanding."""
    
    def match_state_description(self, state_description: str) -> float:
        """Check how well current state matches description."""
        
    def find_elements(self, element_description: str) -> List[UIElement]:
        """Find elements matching description."""
        
    def extract_information(self, target_description: str) -> Any:
        """Extract described information from UI."""

5.3 Action Planning [Complexity: M]

- Implement multi-step action planning
- Add validation of proposed action sequences
- Create result verification for completed actions
- Support retry strategies for failed actions

This functionality enables goal-oriented automation:

@mcp.tool()
async def plan(goal_description: str) -> ActionPlan:
    """Create plan to achieve described goal on current UI."""

@mcp.tool()
async def execute_plan(plan: ActionPlan) -> PlanResult:
    """Execute action plan with validation steps."""

Phase 6: Demonstration & Documentation

Priority: Critical - Shows value
Dependencies: Completed Phase 5

6.1 Real-World Scenario Implementation [Complexity: M]

- Implement common UI automation scenarios (login, form filling, etc.)
- Create comprehensive examples showing framework capabilities
- Add detailed logging and visualization of execution
- Support both scripted and interactive demonstration

Example scenarios will showcase the framework in action:

def demonstrate_login_workflow():
    """Demonstrate login workflow with validation."""
    
    omni = Omni()
    with omni.session():
        # Store credentials in session memory
        omni.store("credentials.username", "testuser")
        omni.store("credentials.password", "password123")
        
        # Check if we're on login page
        if omni.is("login page"):
            # Enter credentials
            username = omni.recall("credentials.username")
            password = omni.recall("credentials.password")
            
            omni.do(f"Enter {username} in username field")
            omni.do(f"Enter {password} in password field")
            omni.do("Click login button")
            
            # Wait for dashboard to load
            omni.wait_for("dashboard page")
            
            # Extract and store account information
            omni.observe("account number").store("user.account_number")
            
            # Check balance
            balance = omni.observe("current balance")
            print(f"Current balance: {balance}")
        else:
            print("Not on login page")

6.2 Advanced Interaction Support [Complexity: M]

- Add support for more complex interactions (drag-and-drop, etc.)
- Implement handling for dynamic content and state changes
- Create strategies for error recovery
- Support batch operations across multiple UI elements

These capabilities will extend the core API:

# Extended Omni methods for advanced interactions
def wait_for(self, state_description: str, timeout: float = 10.0) -> bool:
    """Wait for UI to reach described state."""
    
def retry(self, action_description: str, max_attempts: int = 3) -> ActionResult:
    """Retry action until success or max attempts reached."""
    
def drag(self, source_desc: str, target_desc: str) -> ActionResult:
    """Drag element to target location."""

6.3 Documentation & Examples [Complexity: S]

- Create comprehensive API reference
- Develop clear usage guides with practical examples
- Document architecture and extension points
- Include troubleshooting and best practices

Implementation Strategy

The implementation will focus on simplicity and effectiveness:

  1. API-First Design: Center development around the intuitive Omni API
  2. Minimal Dependencies: Keep external dependencies to a minimum
  3. Test-Driven: Build comprehensive test suite to validate functionality
  4. Progressive Complexity: Start with core capabilities, then add features
  5. Real-World Testing: Continuously validate against real applications

Data Source Architecture

The data source abstraction provides a clean interface for both real and synthetic data:

# Factory method creates appropriate implementation based on configuration
def create_data_manager(config: Config) -> DataSourceManager:
    """Create data source manager based on configuration."""
    if config.USE_SYNTHETIC:
        return SyntheticDataManager(config)
    else:
        return RealDataManager(config)

# Real implementation uses actual screen capture and input
class RealDataManager(DataSourceManager):
    """Manages real UI interaction using mss and pynput."""
    
    def get_screen_state(self) -> ScreenState:
        """Capture real screen state using mss."""
        # Implementation uses mss for screen capture
        
    def perform_action(self, action: Action) -> ActionResult:
        """Execute action using pynput."""
        # Implementation uses pynput for input control

# Synthetic implementation for testing
class SyntheticDataManager(DataSourceManager):
    """Manages synthetic UI for testing."""
    
    def get_screen_state(self) -> ScreenState:
        """Generate synthetic screen state."""
        # Implementation creates synthetic UI representation
        
    def perform_action(self, action: Action) -> ActionResult:
        """Simulate action on synthetic UI."""
        # Implementation simulates action effects

Process Graph Implementation

The process graph will be the central data structure for temporal understanding:

class ProcessGraphNode:
    """Represents a state in the process graph."""
    
    def __init__(self, state_id: str, screen_state: ScreenState):
        self.state_id = state_id
        self.screen_state = screen_state
        self.metadata = {}  # Additional state information

class ProcessGraphEdge:
    """Represents a transition between states in the process graph."""
    
    def __init__(self, 
                source_id: str, 
                target_id: str, 
                action: Action):
        self.source_id = source_id
        self.target_id = target_id
        self.action = action
        self.success_rate = 1.0  # Initial perfect success rate

Next Steps

After completing Phases 1-3:

  1. Implement the DataSourceManager abstraction
  2. Create basic ProcessGraph implementation
  3. Implement the core Omni API as specified
  4. Build minimal MCP protocol implementation
  5. Develop demonstration scenarios

This streamlined approach focuses on creating a usable framework with minimal complexity while maintaining the powerful spatial-temporal understanding capabilities that differentiate OmniMCP.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions