Skip to content

Natural Language Interface to Trigger Relevant Recordings #951

Open
@abrichr

Description

@abrichr

Feature request

Enable users to launch OpenAdapt workflows using natural language commands like “do my taxes,” replacing the current manual replay model. This aligns with modern AI UX expectations and leverages existing recordings more effectively.

Problem

Requiring users to manually select and replay a Recording is unintuitive and limits accessibility. A natural language interface would allow users to describe tasks in plain English and let the system infer the most relevant automation.

Goal

Let users initiate task automation by typing a natural language description. The system finds relevant past demonstrations and uses them to guide replay or plan next steps adaptively.

Components

1. UI Input

  • Launch from system tray icon or similar.
  • Prompt: “What do you want help with today?”
  • Accepts a free-form natural language input.

2. Embedding Generation

  • Generate an embedding from user input using a model like sentence-transformers/all-MiniLM-L6-v2.

3. Embedding Search

  • Use sqlite-vss to find nearest matches among stored Recording.description embeddings.
  • Store embeddings in SQLite (RecordingEmbedding table or similar).
  • Leverage Recording.description as initial source of semantic content.

4. Reranking & Demonstration Injection (instead of hard selection)

  • Instead of selecting a single best match or displaying a list:

    • Retrieve top-N semantically similar recordings.
    • Use reranking (e.g. with a cross-encoder) to sort them.
    • Inject the top few (or most relevant subsections) as demonstrations into the model prompt.
  • At each step, the model sees:

    • Current GUI state (screenshot + accessible DOM or bounding box data).
    • Prior user query.
    • Retrieved demonstrations (replay logs or summaries).
  • The model then decides what to do next.

5. Hierarchical / Recursive Summarization

  • For each recording, generate multi-resolution summaries:

    • High level: “file taxes”
    • Mid-level: “log in to TurboTax”
    • Low level: “click 'T4 Income' tab”
  • Use these summaries to:

    • Improve retrieval accuracy.
    • Enable context-aware planning and prompt construction.
    • Eventually support segmentation and partial replays.

UX Considerations

  • Lightweight, interruptible input box.
  • System should confirm before executing anything destructive.
  • If confidence is low, suggest clarifications or fallback paths.

Alternatives Considered

  • Keyword search: brittle and non-semantic.
  • Command line or dropdown replay: slower and less intuitive.
  • Static top-k display: less adaptive than demonstration-based inference.

Acceptance Criteria

  • User can enter a natural language query from a tray icon.
  • System retrieves and embeds the query, finds similar past demos.
  • The model uses these demos to determine how to proceed.
  • Summaries and embeddings are auto-generated for future replays.

Motivation

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions