OmniMCP Phase 4-6: Spatial-Temporal UI Understanding & Demonstration

After implementing the foundational OmniParser integration (Phases 1-3 in https://github.com/OpenAdaptAI/OmniMCP/issues/2), this roadmap outlines the development of OmniMCP's spatial-temporal synthesis capabilities and high-impact demonstrations. The focus is on efficiently creating a compelling, usable framework with minimal complexity.

## Phase 4: Core Process Graph & UI Understanding

> **Priority**: High - Core differentiation
> **Dependencies**: Completed Phase 3

### 4.1 Unified Data Source Architecture [Complexity: M]
```
- Create abstraction layer that works with both real and synthetic UI data
- Implement consistent interface for screen state acquisition and input control
- Support transparent switching between testing/production environments
- Add configuration-driven source selection with proper logging
```

This architecture creates a clean boundary between real and synthetic implementations:

```python
class DataSourceManager:
    """Manages access to UI state and control with consistent interface."""
    
    def get_screen_state(self) -> ScreenState:
        """Get current UI state from configured source."""
        
    def perform_action(self, action: Action) -> ActionResult:
        """Execute action using configured input method."""
```

### 4.2 Process Graph Framework [Complexity: M]
```
- Implement data structures for representing UI interaction sequences
- Create process graph generation from synthetic UI scenarios
- Add state transition modeling and verification
- Implement graph visualization for debugging/demonstration
```

The process graph captures temporal patterns in UI interactions:

```python
class ProcessGraph:
    """Represents UI automation sequences as directed graphs."""
    
    def add_state(self, screen_state: ScreenState) -> str:
        """Add state to the graph and return state ID."""
        
    def add_transition(self, source_id: str, target_id: str, action: Action) -> None:
        """Add transition between states based on action."""
        
    def find_similar_states(self, screen_state: ScreenState) -> List[SimilarState]:
        """Find states similar to given screen state."""
        
    def suggest_next_actions(self, current_state: ScreenState) -> List[ActionSuggestion]:
        """Suggest possible next actions based on graph."""
```

### 4.3 Core Omni API [Complexity: M]
```
- Implement fluent Omni API as specified in usage pattern
- Add session state management for multi-step operations
- Create natural language action execution
- Support conditional execution based on UI state
```

The core API follows the specified pattern for both programmatic and LLM use:

```python
from omnimcp import Omni

omni = Omni()
with omni.session():
    email = omni.recall("credentials.email")
                            
    if omni.is("Login form ready"):
        omni.do(f"Enter {email}")
        omni.do("Submit login")
                            
    omni.observe("latest transaction date")
        .store("user.last_transaction_date")
```

This intuitive API encapsulates the sophisticated spatial-temporal understanding while providing a simple interface for automation:

```python
class Omni:
    """Main entry point for OmniMCP framework."""
    
    def session(self):
        """Create a new interaction session context manager."""
        
    def recall(self, key: str) -> Any:
        """Recall value from session memory."""
        
    def is_(self, state_description: str) -> bool:
        """Check if current UI matches description."""
        
    def do(self, action_description: str) -> ActionResult:
        """Perform described action on UI."""
        
    def observe(self, target_description: str) -> ObservationResult:
        """Extract information from current UI state."""
```

## Phase 5: MCP Protocol Implementation

> **Priority**: High - API Stability
> **Dependencies**: Completed Phase 4

### 5.1 MCP Core Protocol [Complexity: M]
```
- Implement streamlined MCP protocol for LLM interaction
- Create efficient JSON serialization/deserialization
- Add proper error handling and response types
- Support context window management
```

The MCP protocol provides LLMs with access to the same Omni API:

```python
# Core MCP functions that map directly to the Omni API
@mcp.tool()
async def recall(key: str) -> Any:
    """Recall value from session memory."""

@mcp.tool()
async def is_(state_description: str) -> bool:
    """Check if current UI matches description."""

@mcp.tool()
async def do(action_description: str) -> ActionResult:
    """Perform described action on UI."""

@mcp.tool()
async def observe(target_description: str) -> ObservationResult:
    """Extract information from current UI state."""

@mcp.tool()
async def store(key: str, value: Any) -> StoreResult:
    """Store value in session memory."""
```

### 5.2 Process Understanding [Complexity: M]
```
- Implement UI state analysis for semantic understanding
- Add temporal pattern recognition from interactions
- Create context-aware element targeting
- Support spatial relationship understanding
```

These capabilities power the natural language interface:

```python
class UIStateAnalyzer:
    """Analyzes UI state for semantic understanding."""
    
    def match_state_description(self, state_description: str) -> float:
        """Check how well current state matches description."""
        
    def find_elements(self, element_description: str) -> List[UIElement]:
        """Find elements matching description."""
        
    def extract_information(self, target_description: str) -> Any:
        """Extract described information from UI."""
```

### 5.3 Action Planning [Complexity: M]
```
- Implement multi-step action planning
- Add validation of proposed action sequences
- Create result verification for completed actions
- Support retry strategies for failed actions
```

This functionality enables goal-oriented automation:

```python
@mcp.tool()
async def plan(goal_description: str) -> ActionPlan:
    """Create plan to achieve described goal on current UI."""

@mcp.tool()
async def execute_plan(plan: ActionPlan) -> PlanResult:
    """Execute action plan with validation steps."""
```

## Phase 6: Demonstration & Documentation

> **Priority**: Critical - Shows value
> **Dependencies**: Completed Phase 5

### 6.1 Real-World Scenario Implementation [Complexity: M]
```
- Implement common UI automation scenarios (login, form filling, etc.)
- Create comprehensive examples showing framework capabilities
- Add detailed logging and visualization of execution
- Support both scripted and interactive demonstration
```

Example scenarios will showcase the framework in action:

```python
def demonstrate_login_workflow():
    """Demonstrate login workflow with validation."""
    
    omni = Omni()
    with omni.session():
        # Store credentials in session memory
        omni.store("credentials.username", "testuser")
        omni.store("credentials.password", "password123")
        
        # Check if we're on login page
        if omni.is("login page"):
            # Enter credentials
            username = omni.recall("credentials.username")
            password = omni.recall("credentials.password")
            
            omni.do(f"Enter {username} in username field")
            omni.do(f"Enter {password} in password field")
            omni.do("Click login button")
            
            # Wait for dashboard to load
            omni.wait_for("dashboard page")
            
            # Extract and store account information
            omni.observe("account number").store("user.account_number")
            
            # Check balance
            balance = omni.observe("current balance")
            print(f"Current balance: {balance}")
        else:
            print("Not on login page")
```

### 6.2 Advanced Interaction Support [Complexity: M]
```
- Add support for more complex interactions (drag-and-drop, etc.)
- Implement handling for dynamic content and state changes
- Create strategies for error recovery
- Support batch operations across multiple UI elements
```

These capabilities will extend the core API:

```python
# Extended Omni methods for advanced interactions
def wait_for(self, state_description: str, timeout: float = 10.0) -> bool:
    """Wait for UI to reach described state."""
    
def retry(self, action_description: str, max_attempts: int = 3) -> ActionResult:
    """Retry action until success or max attempts reached."""
    
def drag(self, source_desc: str, target_desc: str) -> ActionResult:
    """Drag element to target location."""
```

### 6.3 Documentation & Examples [Complexity: S]
```
- Create comprehensive API reference
- Develop clear usage guides with practical examples
- Document architecture and extension points
- Include troubleshooting and best practices
```

## Implementation Strategy

The implementation will focus on simplicity and effectiveness:

1. **API-First Design**: Center development around the intuitive Omni API
2. **Minimal Dependencies**: Keep external dependencies to a minimum
3. **Test-Driven**: Build comprehensive test suite to validate functionality
4. **Progressive Complexity**: Start with core capabilities, then add features
5. **Real-World Testing**: Continuously validate against real applications

## Data Source Architecture

The data source abstraction provides a clean interface for both real and synthetic data:

```python
# Factory method creates appropriate implementation based on configuration
def create_data_manager(config: Config) -> DataSourceManager:
    """Create data source manager based on configuration."""
    if config.USE_SYNTHETIC:
        return SyntheticDataManager(config)
    else:
        return RealDataManager(config)

# Real implementation uses actual screen capture and input
class RealDataManager(DataSourceManager):
    """Manages real UI interaction using mss and pynput."""
    
    def get_screen_state(self) -> ScreenState:
        """Capture real screen state using mss."""
        # Implementation uses mss for screen capture
        
    def perform_action(self, action: Action) -> ActionResult:
        """Execute action using pynput."""
        # Implementation uses pynput for input control

# Synthetic implementation for testing
class SyntheticDataManager(DataSourceManager):
    """Manages synthetic UI for testing."""
    
    def get_screen_state(self) -> ScreenState:
        """Generate synthetic screen state."""
        # Implementation creates synthetic UI representation
        
    def perform_action(self, action: Action) -> ActionResult:
        """Simulate action on synthetic UI."""
        # Implementation simulates action effects
```

## Process Graph Implementation

The process graph will be the central data structure for temporal understanding:

```python
class ProcessGraphNode:
    """Represents a state in the process graph."""
    
    def __init__(self, state_id: str, screen_state: ScreenState):
        self.state_id = state_id
        self.screen_state = screen_state
        self.metadata = {}  # Additional state information

class ProcessGraphEdge:
    """Represents a transition between states in the process graph."""
    
    def __init__(self, 
                source_id: str, 
                target_id: str, 
                action: Action):
        self.source_id = source_id
        self.target_id = target_id
        self.action = action
        self.success_rate = 1.0  # Initial perfect success rate
```

## Next Steps

After completing Phases 1-3:

1. Implement the DataSourceManager abstraction
2. Create basic ProcessGraph implementation
3. Implement the core Omni API as specified
4. Build minimal MCP protocol implementation
5. Develop demonstration scenarios

This streamlined approach focuses on creating a usable framework with minimal complexity while maintaining the powerful spatial-temporal understanding capabilities that differentiate OmniMCP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OmniMCP Phase 4-6: Spatial-Temporal UI Understanding & Demonstration #3

Phase 4: Core Process Graph & UI Understanding

4.1 Unified Data Source Architecture [Complexity: M]

4.2 Process Graph Framework [Complexity: M]

4.3 Core Omni API [Complexity: M]

Phase 5: MCP Protocol Implementation

5.1 MCP Core Protocol [Complexity: M]

5.2 Process Understanding [Complexity: M]

5.3 Action Planning [Complexity: M]

Phase 6: Demonstration & Documentation

6.1 Real-World Scenario Implementation [Complexity: M]

6.2 Advanced Interaction Support [Complexity: M]

6.3 Documentation & Examples [Complexity: S]

Implementation Strategy

Data Source Architecture

Process Graph Implementation

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OmniMCP Phase 4-6: Spatial-Temporal UI Understanding & Demonstration #3

Description

Phase 4: Core Process Graph & UI Understanding

4.1 Unified Data Source Architecture [Complexity: M]

4.2 Process Graph Framework [Complexity: M]

4.3 Core Omni API [Complexity: M]

Phase 5: MCP Protocol Implementation

5.1 MCP Core Protocol [Complexity: M]

5.2 Process Understanding [Complexity: M]

5.3 Action Planning [Complexity: M]

Phase 6: Demonstration & Documentation

6.1 Real-World Scenario Implementation [Complexity: M]

6.2 Advanced Interaction Support [Complexity: M]

6.3 Documentation & Examples [Complexity: S]

Implementation Strategy

Data Source Architecture

Process Graph Implementation

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions