Skip to content

LLM Element Tracking System #8

Open
@abrichr

Description

@abrichr

Description

We need to implement a lightweight element tracking system that leverages LLM reasoning capabilities to maintain UI element identity across frames. Rather than building complex confidence algorithms and recovery systems, this approach uses simple tracking data combined with LLM reasoning to determine if elements have temporarily disappeared or truly changed in the UI. This approach follows the 80/20 principle - delivering maximum reliability improvement with minimal implementation complexity.

Background

OmniMCP relies on OmniParser for UI element detection, which can occasionally miss elements due to rendering variations, visual artifacts, or transient UI states. The previously proposed confidence-based tracking and image enhancement system (#7) would require significant engineering effort. This new approach leverages the reasoning capabilities of frontier LLMs to handle the complex decisions about element persistence.

Approach

  1. Implement a basic element tracker that maintains identity across frames
  2. Include tracking metadata in the context provided to the LLM
  3. Let the LLM explicitly analyze the state before deciding actions
  4. Use structured Pydantic models for the LLM's analysis and decisions
  5. Allow the LLM to reason about whether elements are missing or truly gone

Implementation Requirements

1. SimpleElementTracker Class

class SimpleElementTracker:
    """Basic element tracking without complex confidence scoring"""
    
    def __init__(self, max_history=5):
        self.tracked_elements = {}  # {element_id: ElementTrack}
        self.frame_count = 0
        self.max_history = max_history
        
    def update(self, current_detections):
        """Update tracking with new detections"""
        # Match current detections with tracked elements
        # Update tracking data (position history, consecutive misses)
        # Return tracking data for model context

2. Structured Response Models

class ElementTrack(BaseModel):
    """Tracking information for a UI element across frames"""
    element_id: str
    current_detection: Optional[Dict] = None  # Current frame detection (None if not detected)
    previous_detections: List[Dict] = Field(default_factory=list)  # Last N detections
    consecutive_misses: int = 0
    last_seen_frame: int = 0
    
class ScreenAnalysis(BaseModel):
    """Model's analysis of the current UI state with tracking information"""
    reasoning: str = Field(description="Detailed reasoning about the UI state and tracked elements")
    disappeared_elements: List[str] = Field(default_factory=list)
    temporarily_missing_elements: List[str] = Field(default_factory=list)
    new_elements: List[str] = Field(default_factory=list)
    critical_elements: List[str] = Field(default_factory=list)
    
class ActionDecision(BaseModel):
    """Model's decision on what action to take"""
    analysis: ScreenAnalysis
    action_type: str
    target_element_id: Optional[str] = None
    parameters: Dict[str, Any] = Field(default_factory=dict)
    fallback_strategies: List[str] = Field(default_factory=list)

3. MCP Integration

class ModelDrivenVisualState(VisualState):
    """Visual state class that uses LLM reasoning for element persistence"""
    
    def __init__(self, *args, **kwargs):
        self.element_tracker = SimpleElementTracker()
        # Additional initialization
        
    async def update(self):
        # Get current detections
        # Update tracking data
        # Store tracking context for LLM
        
    async def decide_action(self, goal_description):
        # Build prompt with tracking context
        # Get structured response from model
        # Execute decision based on model's reasoning

4. Prompt Design

Design a system prompt that instructs the model to:

  • Analyze the tracking data to determine if elements are truly gone or just missed detections
  • Consider the goal and UI context in its reasoning
  • Provide structured analysis before making action decisions
  • Include fallback strategies when elements are uncertain

Acceptance Criteria

  • System successfully tracks elements across frames with simple position/content matching
  • LLM correctly identifies temporarily missing elements vs. truly disappeared elements in >80% of cases
  • Implementation adds minimal overhead (<30ms per frame, excluding LLM inference)
  • Reduces "element not found" errors by at least 70% compared to current system
  • Makes automation significantly more robust without complex confidence algorithms
  • Provides clear reasoning in the analysis step for debugging purposes

Implementation Priority

  1. Basic element tracking infrastructure
  2. Structured Pydantic models for LLM responses
  3. MCP integration with tracking context
  4. Prompt engineering for effective reasoning
  5. Testing across diverse UI patterns

Future Work (Phase 2)

In a future phase, we can enhance this system with:

  1. Image Enhancement Recovery Techniques:

    • Implement simple image enhancement methods (contrast adjustment, sharpening)
    • Let the LLM decide when to apply these techniques based on context
    • Include recovery results in the tracking data
  2. Hybrid Decision System:

    • Develop simple heuristics for common cases to reduce LLM calls
    • Use the LLM only for ambiguous or complex tracking situations
    • Cache common reasoning patterns for similar situations
  3. Performance Optimizations:

    • Batch LLM reasoning to reduce API calls
    • Compress tracking history for token efficiency
    • Develop lightweight embeddings for more efficient element matching
  4. VLM-Powered Element Verification:

    • Implement an MCP tool that uses vision-language models to verify element presence
    • Allow the LLM to request visual verification for critical missing elements
    • The tool would crop the region where an element was last seen and ask a VLM if the element is still present
    • This provides a high-confidence failsafe for critical automation steps without complex image processing
    • Example implementation:
      @tool("verify_element_presence")
      async def verify_element_presence(
          element_id: str,
          description: str
      ) -> Dict[str, Any]:
          """
          Verify if an element is present in the UI by cropping the region 
          and asking a vision model.
          
          Returns presence confirmation, confidence score, and analysis.
          """
    • This would be particularly valuable for critical UI elements where automation must not fail

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions